[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2018-09-14 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16615434#comment-16615434
 ] 

Ashish Shrowty commented on SPARK-14948:


I have hit this issue and is blocking some critical development. Any idea when 
the PR will be merged?

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-21 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15594989#comment-15594989
 ] 

Ashish Shrowty commented on SPARK-17709:


[~sowen], [~smilegator] - I confirmed that this is not a problem in 2.0.1. 
Sorry .. forgot to come back and post my finding. Thanks for your help guys!

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-15 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578200#comment-15578200
 ] 

Ashish Shrowty commented on SPARK-17709:


Oh .. sorry .. I misread. Will try with 2.0.1 later

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576141#comment-15576141
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 10/14/16 6:41 PM:
--

There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in 
both aggregates, whereas in your plan the ids are difference companyid#5 and 
companyid#46. This is probably causing the resolution error?

I will try in the 2.0.1 branch later today.


was (Author: ashrowty):
There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in both aggregates, whereas in your plan the ids are 
difference companyid#5 and companyid#46. This is probably causing the 
resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Ashish Shrowty (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Shrowty updated SPARK-17709:
---
Comment: was deleted

(was: There is a slight difference .. in my case its companyid#121 in both 
relations whereas in yours its different. Perhaps that is causing the 
resolution error?)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576143#comment-15576143
 ] 

Ashish Shrowty commented on SPARK-17709:


There is a slight difference .. in my case its companyid#121 in both relations 
whereas in yours its different. Perhaps that is causing the resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576141#comment-15576141
 ] 

Ashish Shrowty commented on SPARK-17709:


There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in both aggregates, whereas in your plan the ids are 
difference companyid#5 and companyid#46. This is probably causing the 
resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-13 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573512#comment-15573512
 ] 

Ashish Shrowty commented on SPARK-17709:


[~smilegator] I compiled with the added debug information and here is the 
output -

{code:java}
scala> val d1 = spark.sql("select * from testext2")
d1: org.apache.spark.sql.DataFrame = [productid: int, price: float ... 2 more 
fields]

scala> val df1 = 
d1.groupBy("companyid","productid").agg(sum("price").as("price"))
df1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 
more field]

scala> val df2 = 
d1.groupBy("companyid","productid").agg(sum("count").as("count"))
df2: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 
more field]

scala> df1.join(df2, Seq("companyid", "productid")).show
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] 
can not be resolved given input columns: [companyid, productid, price, count] ;;
'Join UsingJoin(Inner,List('companyid, 'productid))
:- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, 
sum(cast(price#123 as double)) AS price#166]
:  +- Project [productid#122, price#123, count#124, companyid#121]
: +- SubqueryAlias testext2
:+- Relation[productid#122,price#123,count#124,companyid#121] parquet
+- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, 
sum(cast(count#124 as bigint)) AS count#177L]
   +- Project [productid#122, price#123, count#124, companyid#121]
  +- SubqueryAlias testext2
 +- Relation[productid#122,price#123,count#124,companyid#121] parquet

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 48 elided

{code}

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566160#comment-15566160
 ] 

Ashish Shrowty commented on SPARK-17709:


Cool.. thanks. Will do this in next day or two.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566114#comment-15566114
 ] 

Ashish Shrowty commented on SPARK-17709:


I assume I would need to modify the Spark code and build Spark libraries 
locally? Haven't done that before, but willing to try. Is there some docs/links 
you can point me to that show the best way to go about this?

Thanks

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-10 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562368#comment-15562368
 ] 

Ashish Shrowty commented on SPARK-17709:


[~dkbiswal], [~smilegator] - Hi guys .. any thoughts? I exchanged notes with 
AWS guys and they were able to replicate the same issue and believe that it 
might be a Spark issue.

Thanks,
Ashish

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543921#comment-15543921
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 10/4/16 12:47 AM:
--

[~dkbiswal] Sorry Dilip .. I keep making typos .. the join was on companyid and 
product id -

scala> df1.join(df2, Seq("companyid","productid"))
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] 
can not be resolved given input columns: [companyid, productid, avgprice, 
avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 48 elided

Attached is explain outputs for df1 and df2 -

scala> df1.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#53, productid#54], functions=[avg(price#56)])
+- Exchange hashpartitioning(companyid#53, productid#54, 200)
   +- *HashAggregate(keys=[companyid#53, productid#54], 
functions=[partial_avg(price#56)])
  +- *Sample 0.0, 0.5, false, 2419324063718201506
 +- *Project [companyid#53, productid#54, price#56]
+- *BatchedScan parquet 
referencedata.testproduct[productid#54,price#56,companyid#53] Format: 
ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/testtable/companyid=100, 
s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: 
struct

scala> df2.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#53, productid#54], 
functions=[avg(cast(itemcount#57 as bigint))])
+- Exchange hashpartitioning(companyid#53, productid#54, 200)
   +- *HashAggregate(keys=[companyid#53, productid#54], 
functions=[partial_avg(cast(itemcount#57 as bigint))])
  +- *Sample 0.0, 0.5, false, -7492644014085475670
 +- *Project [companyid#53, productid#54, itemcount#57]
+- *BatchedScan parquet 
referencedata.testproduct[productid#54,itemcount#57,companyid#53] Format: 
ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/testtable/companyid=100, 
s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: 
struct

Also the table structure for reference -
scala> spark.sql("describe referencedata.testproduct").show
++-+---+
|col_name|data_type|comment|
++-+---+
|   productid|  int|   null|
|name|   string|   null|
|   price|   double|   null|
|   itemcount|  int|   null|
|   companyid|  int|   null|
|# Partition Infor...| |   |
|  # col_name|data_type|comment|
|   companyid|  int|   null|
++-+---+



was (Author: ashrowty):
[~dkbiswal] Sorry Dilip .. I keep making typos .. the join was on companyid and 
product id -

scala> df1.join(df2, Seq("companyid","productid"))
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] 
can not be resolved given input columns: [companyid, productid, avgprice, 
avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543921#comment-15543921
 ] 

Ashish Shrowty commented on SPARK-17709:


[~dkbiswal] Sorry Dilip .. I keep making typos .. the join was on companyid and 
product id -

scala> df1.join(df2, Seq("companyid","productid"))
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] 
can not be resolved given input columns: [companyid, productid, avgprice, 
avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 48 elided

Attached is explain outputs for df1 and df2 -

scala> df1.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#53, productid#54], functions=[avg(price#56)])
+- Exchange hashpartitioning(companyid#53, productid#54, 200)
   +- *HashAggregate(keys=[companyid#53, productid#54], 
functions=[partial_avg(price#56)])
  +- *Sample 0.0, 0.5, false, 2419324063718201506
 +- *Project [companyid#53, productid#54, price#56]
+- *BatchedScan parquet 
referencedata.testproduct[productid#54,price#56,companyid#53] Format: 
ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/testtable/companyid=100, 
s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: 
struct

scala> df2.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#53, productid#54], 
functions=[avg(cast(itemcount#57 as bigint))])
+- Exchange hashpartitioning(companyid#53, productid#54, 200)
   +- *HashAggregate(keys=[companyid#53, productid#54], 
functions=[partial_avg(cast(itemcount#57 as bigint))])
  +- *Sample 0.0, 0.5, false, -7492644014085475670
 +- *Project [companyid#53, productid#54, itemcount#57]
+- *BatchedScan parquet 
referencedata.testproduct[productid#54,itemcount#57,companyid#53] Format: 
ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/testtable/companyid=100, 
s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: 
struct


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-01 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538927#comment-15538927
 ] 

Ashish Shrowty commented on SPARK-17709:


[~dkbiswal] I just went through manual steps of creating the table in Hive 
(using EMR 5.0.0), inserting data into it, and then querying using spark .. and 
got the exception .. steps I followed - 
Step 1 - 
hive> create external table referencedata.testproduct (
hive> create external table referencedata.testproduct (
> productid int,
> name string,
> price double,
> itemcount int
> ) PARTITIONED BY (companyid int)
> STORED AS PARQUET
> LOCATION 's3://com.birdzi.datalake.test/testtable'
> ;
Step 2 - Insert data -
set hive.exec.dynamic.partition.mode=nonstrict
insert into referencedata.testproduct partition(companyid) 
values(1,"p1",10.0,10,100);
insert into referencedata.testproduct partition(companyid) 
values(2,"p1",12.0,12,100);
insert into referencedata.testproduct partition(companyid) 
values(3,"p3",13.0,12,101);

Step 3 - query using spark-shell -
val d1 = spark.sql("select * from referencedata.testproduct")
val df1 = 
d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice"))
val df2 = 
d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount"))
df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception -
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[companyid, productid, price, avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 49 elided


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-01 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538927#comment-15538927
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 10/1/16 6:00 PM:
-

[~dkbiswal] I just went through manual steps of creating the table in Hive 
(using EMR 5.0.0), inserting data into it, and then querying using spark .. and 
got the exception .. steps I followed - 
Step 1 - 
hive> create external table referencedata.testproduct (
> productid int,
> name string,
> price double,
> itemcount int
> ) PARTITIONED BY (companyid int)
> STORED AS PARQUET
> LOCATION 's3://com.birdzi.datalake.test/testtable'
> ;
Step 2 - Insert data -
set hive.exec.dynamic.partition.mode=nonstrict
insert into referencedata.testproduct partition(companyid) 
values(1,"p1",10.0,10,100);
insert into referencedata.testproduct partition(companyid) 
values(2,"p1",12.0,12,100);
insert into referencedata.testproduct partition(companyid) 
values(3,"p3",13.0,12,101);

Step 3 - query using spark-shell -
val d1 = spark.sql("select * from referencedata.testproduct")
val df1 = 
d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice"))
val df2 = 
d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount"))
df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception -
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[companyid, productid, price, avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 49 elided



was (Author: ashrowty):
[~dkbiswal] I just went through manual steps of creating the table in Hive 
(using EMR 5.0.0), inserting data into it, and then querying using spark .. and 
got the exception .. steps I followed - 
Step 1 - 
hive> create external table referencedata.testproduct (
hive> create external table referencedata.testproduct (
> productid int,
> name string,
> price double,
> itemcount int
> ) PARTITIONED BY (companyid int)
> STORED AS PARQUET
> LOCATION 's3://com.birdzi.datalake.test/testtable'
> ;
Step 2 - Insert data -
set hive.exec.dynamic.partition.mode=nonstrict
insert into referencedata.testproduct partition(companyid) 
values(1,"p1",10.0,10,100);
insert into referencedata.testproduct partition(companyid) 
values(2,"p1",12.0,12,100);
insert into referencedata.testproduct partition(companyid) 
values(3,"p3",13.0,12,101);

Step 3 - query using spark-shell -
val d1 = spark.sql("select * from referencedata.testproduct")
val df1 = 
d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice"))
val df2 = 
d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount"))
df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception -
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[companyid, productid, price, avgitemcount] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537740#comment-15537740
 ] 

Ashish Shrowty commented on SPARK-17709:


Join keys are both companyid and loyaltycardnumber. Wonder why you are not 
seeing it. I tried it on a few other tables I have and its the same behavior.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537264#comment-15537264
 ] 

Ashish Shrowty commented on SPARK-17709:


[~dkbiswal] Attached are the explain() outputs -

df1.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[avg(cast(itemcount#3372 as bigint))])
+- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200)
   +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[partial_avg(cast(itemcount#3372 as bigint))])
  +- *Project [loyaltycardnumber#3370, itemcount#3372, companyid#3364]
 +- *BatchedScan parquet 
facts.storetransaction[loyaltycardnumber#3370,itemcount#3372,year#3362,month#3363,companyid#3364]
 Format: ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month...,
 PushedFilters: [], ReadSchema: struct

df2.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[avg(totalprice#3373)])
+- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200)
   +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[partial_avg(totalprice#3373)])
  +- *Project [loyaltycardnumber#3370, totalprice#3373, companyid#3364]
 +- *BatchedScan parquet 
facts.storetransaction[loyaltycardnumber#3370,totalprice#3373,year#3362,month#3363,companyid#3364]
 Format: ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month...,
 PushedFilters: [], ReadSchema: 
struct


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536694#comment-15536694
 ] 

Ashish Shrowty commented on SPARK-17709:


Sorry ... its not really col1, its another column .. edited it to col8

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535957#comment-15535957
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 9/30/16 6:37 PM:
-

Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col8 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 



was (Author: ashrowty):
Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col1 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535957#comment-15535957
 ] 

Ashish Shrowty commented on SPARK-17709:


Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col1 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534518#comment-15534518
 ] 

Ashish Shrowty commented on SPARK-17709:


Dilip,

I tried your code and it works on my end too. It's only when I try load an 
external table stored as parquet (in my case its stored in S3). Attaching stack 
trace if that helps (this time I tried on on a different table and hence the 
difference in column names) -

org.apache.spark.sql.AnalysisException: using columns ['productid] can not be 
resolved given input columns: [productid, name1, name2] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-28 Thread Ashish Shrowty (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Shrowty updated SPARK-17709:
---
Description: 
If I try to inner-join two dataframes which originated from the same initial 
dataframe that was loaded using spark.sql() call, it results in an error -

// reading from Hive .. the data is stored in Parquet format in Amazon S3
val d1 = spark.sql("select * from ")  
val df1 = d1.groupBy("key1","key2")
  .agg(avg("totalprice").as("avgtotalprice"))
val df2 = d1.groupBy("key1","key2")
  .agg(avg("itemcount").as("avgqty")) 
df1.join(df2, Seq("key1","key2")) gives error -
org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];

If the same Dataframe is initialized via spark.read.parquet(), the above code 
works. This same code above worked with Spark 1.6.2

  was:
If I try to inner-join two dataframes which originated from the same initial 
dataframe that was loaded using spark.sql() call, it results in an error -

// reading from Hive .. the data is stored in Parquet format in Amazon S3
val d1 = spark.sql("select * from ")  
val df1 = d1.groupBy("key1","key2")
  .agg(avg("totalprice").as("avgtotalprice"))
val df2 = d1.groupBy("key1","key2")
  .agg(avg("itemcount").as("avgqty")) 
df1.join(df2, Seq("key1","key2")) gives error -
org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];

If the same Dataframe is initialized via spark.read.parquet(), the above code 
works.


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-28 Thread Ashish Shrowty (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Shrowty updated SPARK-17709:
---
Priority: Major  (was: Critical)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-28 Thread Ashish Shrowty (JIRA)
Ashish Shrowty created SPARK-17709:
--

 Summary: spark 2.0 join - column resolution error
 Key: SPARK-17709
 URL: https://issues.apache.org/jira/browse/SPARK-17709
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Ashish Shrowty
Priority: Critical


If I try to inner-join two dataframes which originated from the same initial 
dataframe that was loaded using spark.sql() call, it results in an error -

// reading from Hive .. the data is stored in Parquet format in Amazon S3
val d1 = spark.sql("select * from ")  
val df1 = d1.groupBy("key1","key2")
  .agg(avg("totalprice").as("avgtotalprice"))
val df2 = d1.groupBy("key1","key2")
  .agg(avg("itemcount").as("avgqty")) 
df1.join(df2, Seq("key1","key2")) gives error -
org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];

If the same Dataframe is initialized via spark.read.parquet(), the above code 
works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-06-27 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350954#comment-15350954
 ] 

Ashish Shrowty commented on SPARK-2984:
---

Hi,

I would echo Sandeep's comments. I am seeing this same exception with Spark 
1.6.1 on EMR writing partitioned parquet files to S3. I am still experimenting 
with it, but when I turn on EMRFS, the exceptions seem to go away. Most 
probably its the eventual consistency issue. Would be great if retries could be 
added within Spark for eventually consistent stores/filesystems.

-Ashish

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
>