[jira] [Commented] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

2015-01-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272748#comment-14272748
 ] 

Apache Spark commented on SPARK-5049:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3990

 ParquetTableScan always prepends the values of partition columns in output 
 rows irrespective of the order of the partition columns in the original 
 SELECT query
 ---

 Key: SPARK-5049
 URL: https://issues.apache.org/jira/browse/SPARK-5049
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Rahul Aggarwal

 This happens when ParquetTableScan is being used by turning on 
 spark.sql.hive.convertMetastoreParquet
 For example:
 spark-sql set spark.sql.hive.convertMetastoreParquet=true;
 spark-sql create table table1(a int , b int) partitioned by (p1 string, p2 
 int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
 INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
 'parquet.hive.DeprecatedParquetOutputFormat';
 spark-sql insert into table table1 partition(p1='January',p2=1) select key, 
 10  from src;
 spark-sql select a, b, p1, p2 from table1 limit 10;
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 The correct output should be 
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 This also leads to schema mismatch if the query is run using HiveContext and 
 the result is a SchemaRDD.
 For example :
 scala import org.apache.spark.sql.hive._
 scala val hc = new HiveContext(sc)
 scala hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 scala val res = hc.sql(select a, b, p1, p2 from table1 limit 10)
 scala res.collect
 res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
 [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
 [January,1,409,10], [January,1,255,10], [January,1,278,10], 
 [January,1,98,10], [January,1,484,10])
 scala res.schema
 res5: org.apache.spark.sql.StructType = 
 StructType(ArrayBuffer(StructField(a,IntegerType,true), 
 StructField(b,IntegerType,true), StructField(p1,StringType,true), 
 StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

2015-01-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262546#comment-14262546
 ] 

Apache Spark commented on SPARK-5049:
-

User 'rahulaggarwalguavus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3870

 ParquetTableScan always prepends the values of partition columns in output 
 rows irrespective of the order of the partition columns in the original 
 SELECT query
 ---

 Key: SPARK-5049
 URL: https://issues.apache.org/jira/browse/SPARK-5049
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Rahul Aggarwal

 This happens when ParquetTableScan is being used by turning on 
 spark.sql.hive.convertMetastoreParquet
 For example:
 spark-sql set spark.sql.hive.convertMetastoreParquet=true;
 spark-sql create table table1(a int , b int) partitioned by (p1 string, p2 
 int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
 INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
 'parquet.hive.DeprecatedParquetOutputFormat';
 spark-sql insert into table table1 partition(p1='January',p2=1) select key, 
 10  from src;
 spark-sql select a, b, p1, p2 from table1 limit 10;
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 The correct output should be 
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 This also leads to schema mismatch if the query is run using HiveContext and 
 the result is a SchemaRDD.
 For example :
 scala import org.apache.spark.sql.hive._
 scala val hc = new HiveContext(sc)
 scala hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 scala val res = hc.sql(select a, b, p1, p2 from table1 limit 10)
 scala res.collect
 res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
 [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
 [January,1,409,10], [January,1,255,10], [January,1,278,10], 
 [January,1,98,10], [January,1,484,10])
 scala res.schema
 res5: org.apache.spark.sql.StructType = 
 StructType(ArrayBuffer(StructField(a,IntegerType,true), 
 StructField(b,IntegerType,true), StructField(p1,StringType,true), 
 StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

2015-01-01 Thread Rahul Aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262547#comment-14262547
 ] 

Rahul Aggarwal commented on SPARK-5049:
---

https://github.com/apache/spark/pull/3870

 ParquetTableScan always prepends the values of partition columns in output 
 rows irrespective of the order of the partition columns in the original 
 SELECT query
 ---

 Key: SPARK-5049
 URL: https://issues.apache.org/jira/browse/SPARK-5049
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Rahul Aggarwal

 This happens when ParquetTableScan is being used by turning on 
 spark.sql.hive.convertMetastoreParquet
 For example:
 spark-sql set spark.sql.hive.convertMetastoreParquet=true;
 spark-sql create table table1(a int , b int) partitioned by (p1 string, p2 
 int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
 INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
 'parquet.hive.DeprecatedParquetOutputFormat';
 spark-sql insert into table table1 partition(p1='January',p2=1) select key, 
 10  from src;
 spark-sql select a, b, p1, p2 from table1 limit 10;
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 January   1   484 10
 The correct output should be 
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 484   10  January 1
 This also leads to schema mismatch if the query is run using HiveContext and 
 the result is a SchemaRDD.
 For example :
 scala import org.apache.spark.sql.hive._
 scala val hc = new HiveContext(sc)
 scala hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 scala val res = hc.sql(select a, b, p1, p2 from table1 limit 10)
 scala res.collect
 res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
 [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
 [January,1,409,10], [January,1,255,10], [January,1,278,10], 
 [January,1,98,10], [January,1,484,10])
 scala res.schema
 res5: org.apache.spark.sql.StructType = 
 StructType(ArrayBuffer(StructField(a,IntegerType,true), 
 StructField(b,IntegerType,true), StructField(p1,StringType,true), 
 StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org