[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-08-01 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897948#comment-16897948
 ] 

Manu Zhang commented on SPARK-27689:


[~lambda], [~hyukjin.kwon], [~yumwang],

This issue should be fixed by 
[PR#24960|https://github.com/apache/spark/pull/24960] 
([PR#25068|https://github.com/apache/spark/pull/25068] for branch-2.4 and 
[PR#25293|https://github.com/apache/spark/pull/25293] for 2.3) where analyzing 
View is deferred to optimizer. 

Before that View was analyzed first and JOIN would try to deduplicate 
conflicting attributes (since you are joining on the same id_person column) by 
*replacing* ids of View's plan in the *right* branch. That's where things went 
wrong. Take a look at the logical plan under View
{code:java}
+- View (`schema_p`.`person_product_v`, 
[id_person#76,id_product#77,country#78,city#79,price#80,start_date#81,end_date#82])
  +- !Project [cast(id_person#103 as int) AS id_person#76, cast(id_product#104 
as int) AS id_product#77, cast(country#105 as string) AS country#78, 
cast(city#106 as string) AS city#79, cast(price#107 as decimal(38,8)) AS 
price#80, cast(start_date#108 as date) AS start_date#81, cast(end_date#109 as 
date) AS end_date#82]
+- Project [cast(id_person#7 as int) AS id_person#0, cast(id_product#8 as 
int) AS id_product#1, cast(country#9 as string) AS country#2, cast(city#10 as 
string) AS city#3, cast(price#11 as decimal(38,8)) AS price#4, 
cast(start_date#12 as date) AS start_date#5, cast(end_date#13 as date) AS 
end_date#6]
{code}
The input of the node {{!Project}} (e.g. id_person#103) didn't match output of 
its child (e.g. id_person#0). That's because id_person#103 was replaced from 
id_person#0 in the deduplicating process while id_person#0 could not be 
replaced since it's not an attribute (check out Alias).
  
 This can be reproduced with a UT like 
{code}
test("SparkSQL failed to resolve attributes with nested self-joins on hive 
view") { 
  withTable("hive_table") { 
withView("hive_view", "temp_view1", "temp_view2") { 
   sql("CREATE TABLE hive_table AS SELECT 1 AS id") 
   sql("CREATE VIEW hive_view AS SELECT id FROM hive_table") 
   sql("CREATE TEMPORARY VIEW temp_view1 AS SELECT id FROM hive_view") 
   sql("CREATE TEMPORARY VIEW temp_view2 AS SELECT a.id " + 
  "FROM temp_view1 AS a JOIN temp_view1 AS b ON a.id = b.id") 
   val df = sql("SELECT c.id FROM temp_view1 AS c JOIN temp_view2 AS d ON 
c.id = d.id") 
   checkAnswer(df, Row(1)) 
} 
  }
}
{code}

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-07-31 Thread feiwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897120#comment-16897120
 ] 

feiwang commented on SPARK-27689:
-

You can add an unit test into HiveSQLViewSuite.scala to reproduce it with the 
code below.
```
withTable("ta") {
  withView("va") {
withView("vb") {
  withView("vc") {
sql("CREATE TABLE ta (c1 STRING)")
sql("CREATE VIEW va(c1) AS SELECT * FROM ta")
sql("CREATE TEMPORARY VIEW vb AS SELECT a.c1 FROM va AS a")
sql("CREATE TEMPORARY VIEW vc AS SELECT a.c1 FROM vb AS a JOIN vb 
as b ON a.c1 = b.c1")
sql("SELECT a.c1 FROM vb as a JOIN vc as b ON a.c1 = b.c1")
  }
}
  }
}
```

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-07-30 Thread feiwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896249#comment-16896249
 ] 

feiwang commented on SPARK-27689:
-

It seemed that this failure is caused by  PR-SPARK-18801,  
https://github.com/apache/spark/pull/16233.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, cast(start_date#108 as date) AS start_date#81, 
> cast(end_date#109 as date) AS end_date#82]. Attribute(s) with the same name 
> appear in the operation: 
> id_person,start_date,id_product,end_date,price,country,city. Please check if 
> the right attribute(s) are used.;;
> Project [id_person#0, id_product#1, country#2, city#3, price#4, start_date#5, 
> end_date#6, name#29, country#30, city#31, start_date#32, end_date#33, 
> id_product#104, country#105, city#106, price#107, 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-07-30 Thread feiwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896247#comment-16896247
 ] 

feiwang commented on SPARK-27689:
-

It seemed that this failure is caused by  PR-SPARK-18801,  
https://github.com/apache/spark/pull/16233.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, cast(start_date#108 as date) AS start_date#81, 
> cast(end_date#109 as date) AS end_date#82]. Attribute(s) with the same name 
> appear in the operation: 
> id_person,start_date,id_product,end_date,price,country,city. Please check if 
> the right attribute(s) are used.;;
> Project [id_person#0, id_product#1, country#2, city#3, price#4, start_date#5, 
> end_date#6, name#29, country#30, city#31, start_date#32, end_date#33, 
> id_product#104, country#105, city#106, price#107, 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-07-11 Thread Juan Antonio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882716#comment-16882716
 ] 

Juan Antonio commented on SPARK-27689:
--

Any news about this? 

Regards.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, cast(start_date#108 as date) AS start_date#81, 
> cast(end_date#109 as date) AS end_date#82]. Attribute(s) with the same name 
> appear in the operation: 
> id_person,start_date,id_product,end_date,price,country,city. Please check if 
> the right attribute(s) are used.;;
> Project [id_person#0, id_product#1, country#2, city#3, price#4, start_date#5, 
> end_date#6, name#29, country#30, city#31, start_date#32, end_date#33, 
> id_product#104, country#105, city#106, price#107, start_date#108, 
> end_date#109, name#137, country#138, 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-20 Thread Juan Antonio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843686#comment-16843686
 ] 

Juan Antonio commented on SPARK-27689:
--

I have just updated the code to simplify it. I have deleted several lines and a 
table with its view. On the other hand, the code line that makes this error is:
{quote}{{final = person_product_join.join(person_count, 'id_person', 'left') 
}}{quote}
if you changes this line with:
{quote}final = person_count.join(person_product_join, 'id_person', 'right')
{quote}
the error doesn't appear. The error can be make by the order of the tables in 
the join because:
{quote}final = person_product_join.join(person_count, 'id_person', 'left')   
> Error

final = person_product_join.join(person_count, 'id_person', 'right') 
> Error

final = person_count.join(person_product_join, 'id_person', 'right') 
> No error

final = person_count.join(person_product_join, 'id_person', 'left') 
--> No error
{quote}
Regards.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842124#comment-16842124
 ] 

Hyukjin Kwon commented on SPARK-27689:
--

You can remove one op and try again see if that is reprocible. That can make 
the narrowed down reproducer.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> {code}
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *
> *Code*:
>  
> {code}
> def read_tables(sc):
>   in_dict = { 'person': 'person_v', 'product': 'product_v', 'person_product': 
> 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   person = tables['person']
>   product = tables['product']
>   person_product = tables['person_product']
>   person_product_join = person_product.join(product,'id_product')
>   count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_count = person_product_join.join(count_prod,'id_product')
>   final1 = person_product_join.join(person_count, 'id_person', 'left')
>   final = final1.withColumn('reference_date', F.lit(ref_date_str))
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show(){code}
>  
> *Error:*
> {code:java}
> File "", line 1, in 
>  File "", line 10, in get_population
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
>  jdf = self._jdf.join(other._jdf, on, how)
>  File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
>  raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>  pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_product#124,end_date#129,city#126,price#127,start_date#128,id_person#123,country#125
> 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-15 Thread Juan Antonio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840353#comment-16840353
 ] 

Juan Antonio commented on SPARK-27689:
--

I don't know that I can add to because I wrote how to simulate the error and 
the message that It produces.could you give me a clew, please?

Besides, I think the error is generated by CheckAnalysis clases, which is in 
catalyst package.

Regards

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Minor
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> {code}
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *
> *Code*:
>  
> {code}
> def read_tables(sc):
>   in_dict = { 'person': 'person_v', 'product': 'product_v', 'person_product': 
> 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   person = tables['person']
>   product = tables['product']
>   person_product = tables['person_product']
>   person_product_join = person_product.join(product,'id_product')
>   count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_count = person_product_join.join(count_prod,'id_product')
>   final1 = person_product_join.join(person_count, 'id_person', 'left')
>   final = final1.withColumn('reference_date', F.lit(ref_date_str))
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show(){code}
>  
> *Error:*
> {code:java}
> File "", line 1, in 
>  File "", line 10, in get_population
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
>  jdf = self._jdf.join(other._jdf, on, how)
>  File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
>  raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>  

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840186#comment-16840186
 ] 

Hyukjin Kwon commented on SPARK-27689:
--

Please avoid to set Critical+ which is usually reserved for committers.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> {code}
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *
> *Code*:
>  
> {code}
> def read_tables(sc):
>   in_dict = { 'person': 'person_v', 'product': 'product_v', 'person_product': 
> 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   person = tables['person']
>   product = tables['product']
>   person_product = tables['person_product']
>   person_product_join = person_product.join(product,'id_product')
>   count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_count = person_product_join.join(count_prod,'id_product')
>   final1 = person_product_join.join(person_count, 'id_person', 'left')
>   final = final1.withColumn('reference_date', F.lit(ref_date_str))
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show(){code}
>  
> *Error:*
> {code:java}
> File "", line 1, in 
>  File "", line 10, in get_population
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
>  jdf = self._jdf.join(other._jdf, on, how)
>  File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
>  raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>  pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_product#124,end_date#129,city#126,price#127,start_date#128,id_person#123,country#125
>  missing from 
> 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840184#comment-16840184
 ] 

Hyukjin Kwon commented on SPARK-27689:
--

[~lambda], please narrow down the problem if you're not going to fix. Otherwise 
no one can or willing to investigate.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Critical
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> {code}
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *
> *Code*:
>  
> {code}
> def read_tables(sc):
>   in_dict = { 'person': 'person_v', 'product': 'product_v', 'person_product': 
> 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   person = tables['person']
>   product = tables['product']
>   person_product = tables['person_product']
>   person_product_join = person_product.join(product,'id_product')
>   count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_count = person_product_join.join(count_prod,'id_product')
>   final1 = person_product_join.join(person_count, 'id_person', 'left')
>   final = final1.withColumn('reference_date', F.lit(ref_date_str))
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show(){code}
>  
> *Error:*
> {code:java}
> File "", line 1, in 
>  File "", line 10, in get_population
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
>  jdf = self._jdf.join(other._jdf, on, how)
>  File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
>  raise AnalysisException(s.split(': ', 1)[1], stackTrace)
>  pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-14 Thread Juan Antonio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839170#comment-16839170
 ] 

Juan Antonio commented on SPARK-27689:
--

Thanks Yuming Wang.

First of all, I have executed the Sql statements on Hive and it runs correctly. 
The problem is when you run the spark code with Hive views and it fails. On the 
other hand, if you run the spark code with tables (the changes are very easy, 
you only have to change in the function read_tables, the view names to *name_v* 
removing _v). Another important thing about this issue is that if you use 
withColumnRenamed method of the dataframe or if you write the dataframe on HDFS 
and after it you read it, in some parts of the code it works.

To simulate the error you can:

1) Create hive database.

2) Create hive tables.

3) Create hive views.

4) Open pyspark with the version 2.3.0 and copy the spark code

I have checked other bugs similar to it and it can be related to the bugs 
https://issues.apache.org/jira/browse/SPARK-25051 and 
https://issues.apache.org/jira/browse/SPARK-24865 . I also have run this code 
with spark 2.4.3 because I saw the bug 24865 but it didn't work.

 

Regards.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Critical
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> *Hive tables:*
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> *Hive views:*
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> *
> *Code*:
>  
> {code:python}
> def read_tables(sc):
>   in_dict = { 'person': 'person_v', 'product': 'product_v', 'person_product': 
> 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   person = tables['person']
>   product = tables['product']
>   person_product = tables['person_product']
>   person_product_join = person_product.join(product,'id_product')
>   count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_count = person_product_join.join(count_prod,'id_product')
>   final1 = person_product_join.join(person_count, 'id_person', 'left')
>   final = final1.withColumn('reference_date', F.lit(ref_date_str))
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-14 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839114#comment-16839114
 ] 

Yuming Wang commented on SPARK-27689:
-

I can't reproduce it after translate to SQL, [~lambda] please correct if I'm 
wrong:
{code:sql}
-- person_product_join
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product ;

-- count_prod
SELECT id_product, COUNT(id_product) AS count_prod
FROM person_product
GROUP BY id_product;

-- person_count
SELECT t3.*, t4.count_prod
FROM (
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product
) t3
JOIN (
SELECT id_product, COUNT(id_product) AS count_prod
FROM person_product
GROUP BY id_product
) t4 ON t3.id_product = t4.id_product;

-- final
SELECT t5.*, t6.count_prod
FROM (
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product
) t5
LEFT JOIN (
SELECT t3.*, t4.count_prod
FROM (
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product
) t3
JOIN (
SELECT id_product, COUNT(id_product) AS 
count_prod
FROM person_product
GROUP BY id_product
) t4 ON t3.id_product = t4.id_product
) t6 ON t5.id_person = t6.id_person;

-- final1
SELECT t5.*, t6.count_prod, '2019-04-30' as reference_date
FROM (
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product
) t5
LEFT JOIN (
SELECT t3.*, t4.count_prod
FROM (
SELECT t1.*, t2.name
FROM person_product t1
JOIN product t2 ON t1.id_product = t2.id_product
) t3
JOIN (
SELECT id_product, COUNT(id_product) AS 
count_prod
FROM person_product
GROUP BY id_product
) t4 ON t3.id_product = t4.id_product
) t6 ON t5.id_person = t6.id_person;
{code}

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Critical
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> *Hive tables:*
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person';
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> *Hive views:*
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
>  CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-05-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838696#comment-16838696
 ] 

Yuming Wang commented on SPARK-27689:
-

Thank you [~lambda] I will check it.

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Juan Antonio
>Priority: Critical
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> *Hive tables:*
> CREATE TABLE schema_p.person(
>  id_person string, 
>  identifier string, 
>  gender string, 
>  start_date string, 
>  end_date string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 'hdfs:///tmp/schema_p/person';
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 'hdfs:///tmp/schema_p/product';
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 'hdfs:///tmp/schema_p/person_product';
> *Hive views:*
> CREATE VIEW schema_p.person_v AS SELECT CAST(id_person AS INT) AS id_person, 
> CAST(identifier AS INT) AS identifier, gender AS gender, CAST(start_date AS 
> DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person;
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> *
> *Code*:
> def read_tables(sc):
>  in_dict = {
>  'person': 'person_v',
>  'product': 'product_v',
>  'person_product': 'person_product_v'
>  }
>  data_dict = {}
>  for n, d in in_dict.iteritems():
>  data_dict[n] = sc.read.table(d)
>  return data_dict
> def get_population(tables, ref_date_str):
>  person = tables['person']
>  product = tables['product']
>  person_product = tables['person_product']
>  
>  person_product_join = person_product.join(product,'id_product')
>  count_prod = 
> person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>  
>  person_count = person_product_join.join(count_prod,'id_product')
>  final1 = person_product_join.join(person_count, 'id_person', 'left')
>  final = final1.withColumn('reference_date', F.lit(ref_date_str))
>  return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> *
> *Error:*
> File "", line 1, in 
>  File "", line 10, in get_population
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
>  jdf = self._jdf.join(other._jdf, on, how)
>  File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
>  raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_product#124,end_date#129,city#126,price#127,start_date#128,id_person#123,country#125
>  missing from 
> city#47,price#48,start_date#49,id_product#45,end_date#50,id_person#44,country#46
>  in operator !Project [cast(id_person#123 as int) AS id_person#96, 
> cast(id_product#124