[jira] [Assigned] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19658:


Assignee: Apache Spark  (was: Xiao Li)

> Set NumPartitions of RepartitionByExpression In Analyzer
> 
>
> Key: SPARK-19658
> URL: https://issues.apache.org/jira/browse/SPARK-19658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, if {{NumPartitions}} is not set, we will set it using 
> `spark.sql.shuffle.partitions` in Planner. However, this is not following 
> general resolution process. We should do it in Analyzer and then Optimizer 
> can use the value for optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer

2017-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873522#comment-15873522
 ] 

Apache Spark commented on SPARK-19658:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16988

> Set NumPartitions of RepartitionByExpression In Analyzer
> 
>
> Key: SPARK-19658
> URL: https://issues.apache.org/jira/browse/SPARK-19658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, if {{NumPartitions}} is not set, we will set it using 
> `spark.sql.shuffle.partitions` in Planner. However, this is not following 
> general resolution process. We should do it in Analyzer and then Optimizer 
> can use the value for optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19658:


Assignee: Xiao Li  (was: Apache Spark)

> Set NumPartitions of RepartitionByExpression In Analyzer
> 
>
> Key: SPARK-19658
> URL: https://issues.apache.org/jira/browse/SPARK-19658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, if {{NumPartitions}} is not set, we will set it using 
> `spark.sql.shuffle.partitions` in Planner. However, this is not following 
> general resolution process. We should do it in Analyzer and then Optimizer 
> can use the value for optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer

2017-02-18 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19658:
---

 Summary: Set NumPartitions of RepartitionByExpression In Analyzer
 Key: SPARK-19658
 URL: https://issues.apache.org/jira/browse/SPARK-19658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


Currently, if {{NumPartitions}} is not set, we will set it using 
`spark.sql.shuffle.partitions` in Planner. However, this is not following 
general resolution process. We should do it in Analyzer and then Optimizer can 
use the value for optimization. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19647) Spark query hive is extremelly slow even the result data is small

2017-02-18 Thread wuchang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873490#comment-15873490
 ] 

wuchang commented on SPARK-19647:
-

Hi , I don't think that it is just a question ,but also it maybe a quite 
serious bug.But I am not sure.

> Spark query hive is extremelly slow even the result data is small
> -
>
> Key: SPARK-19647
> URL: https://issues.apache.org/jira/browse/SPARK-19647
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: wuchang
>Priority: Critical
>
> I am using spark 2.0.0 to query hive table:
> my sql is:
> select * from app.abtestmsg_v limit 10
> Yes, I want to get the first 10 records from a view app.abtestmsg_v.
> When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
> But then the problem comes when I try to implement this query by my python 
> code.
> I am using Spark 2.0.0 and write a very simple pyspark program, code is:
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import *
> import json
> hc = HiveContext(sc)
> hc.setConf("hive.exec.orc.split.strategy", "ETL")
> hc.setConf("hive.security.authorization.enabled",false)
> zj_sql = 'select * from app.abtestmsg_v limit 10'
> zj_df = hc.sql(zj_sql)
> zj_df.collect()
> From the info log , I find: although I use "limit 10" to tell spark that I 
> just want the first 10 records , but spark still scan and read all files(in 
> my case, the source data of this view contains 100 files and each file's size 
> is about 1G) of the view , So , there are nearly 100 tasks , each task read a 
> file , and all the task is executed serially. I use nearlly 15 minutes to 
> finish these 100 tasks! but what I want is just to get the first 10 
> records.
> So , I don't know what to do and what is wrong;
> Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16931) PySpark access to data-frame bucketing api

2017-02-18 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873329#comment-15873329
 ] 

Maciej Szymkiewicz commented on SPARK-16931:


[~sowen] Is there any particular reason for "Won't fix"? I don't want to get it 
in your way but it is an important feature and you should be implemented sooner 
or later. I am happy to see this through if no one else is interested.

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18891) Support for specific collection types

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18891:


Assignee: Apache Spark

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18891) Support for specific collection types

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18891:


Assignee: (was: Apache Spark)

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18891) Support for specific collection types

2017-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873300#comment-15873300
 ] 

Apache Spark commented on SPARK-18891:
--

User 'michalsenkyr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16986

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19122:


Assignee: Apache Spark

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-02-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19122:


Assignee: (was: Apache Spark)

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873287#comment-15873287
 ] 

Apache Spark commented on SPARK-19122:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/16985

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-18 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-19638.

Resolution: Duplicate

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873237#comment-15873237
 ] 

Takeshi Yamamuro edited comment on SPARK-19638 at 2/18/17 4:25 PM:
---

I found this ticket is duplicated to SPARK-17636, so I'll close as "Duplicated".


was (Author: maropu):
I found this ticket is duplicated to SPARK-17636, so I'll close as 
"Duplicated". Thanks.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873237#comment-15873237
 ] 

Takeshi Yamamuro commented on SPARK-19638:
--

I found this ticket is duplicated to SPARK-17636, so I'll close as 
"Duplicated". Thanks.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8510) NumPy arrays and matrices as values in sequence files

2017-02-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8510.
-
Resolution: Won't Fix

I am resolving this per comments in https://github.com/apache/spark/pull/8384  
and 

{quote}
there is a clear indication that there is not support or interest in acting on 
it, then resolve as Won’t Fix
{quote}

in http://spark.apache.org/contributing.html

as I assume it is a soft-yes for not supporting.

Please reopen the JIRA and the PR too if anyone feels I misunderstood.

> NumPy arrays and matrices as values in sequence files
> -
>
> Key: SPARK-8510
> URL: https://issues.apache.org/jira/browse/SPARK-8510
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Peter Aberline
>Priority: Minor
>
> Using the DoubleArrayWritable as an example, I have added support for storing 
> NumPy arrays and matrices as elements of Sequence Files.
> Each value element is a discrete matrix or array. This is useful where you 
> have many matrices that you don't want to join into a single Spark DataFrame 
> to store in a Parquet file.
> There seems to be demand for this functionality:
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
> I originally put this work in PR 6995, but closed it after suggestions from a 
> user to use NumPy's built in serialization. My second version is in PR 8384.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.

2017-02-18 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19263.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 1.2.0

> DAGScheduler should avoid sending conflicting task set.
> ---
>
> Key: SPARK-19263
> URL: https://issues.apache.org/jira/browse/SPARK-19263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 1.2.0
>
>
> In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is 
> *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, 
> which maybe a bug when *FetchFailed* happens. Think about below:
> # Stage 0 runs and generates shuffle output data.
> # Stage 1 reads the output from stage 0 and generates more shuffle data. It 
> has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are 
> launched on executorA.
> # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to 
> the driver. The driver marks executorA as lost and updates failedEpoch;
> # The driver resubmits stage 0 so the missing output can be re-generated, and 
> then once it completes, resubmits stage 1 with ShuffleMapTask1x and 
> ShuffleMapTask2x.
> # ShuffleMapTask2 (from the original attempt of stage 1) successfully 
> finishes on executorA and sends Success back to driver. This causes 
> DAGScheduler::handleTaskCompletion to remove partition 2 from 
> stage.pendingPartitions (line 1149), but it does not add the partition to the 
> set of output locations (line 1192), because the task’s epoch is less than 
> the failure epoch for the executor (because of the earlier failure on 
> executor A)
> # ShuffleMapTask1x successfully finishes on executorB, causing the driver to 
> remove partition 1 from stage.pendingPartitions. Combined with the previous 
> step, this means that there are no more pending partitions for the stage, so 
> the DAGScheduler marks the stage as finished (line 1196). However, the 
> shuffle stage is not available (line 1215) because the completion for 
> ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler 
> resubmits the stage.
> # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks 
> is called for the re-submitted stage, it throws an error, because there’s an 
> existing active task set
> To reproduce the bug:
> 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check 
> whether the task's index in *TaskSetManager* and stage attempt equal to 0 at 
> the same time, if so, throw FetchFailedException;
> 2. Rebuild spark then submit following job:
> {code}
> val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, 
> 3), (2, 1), (3, 1)), 2)
> rdd.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.map {
>   keyAndValue => {
> (keyAndValue._1 % 2, keyAndValue._2)
>   }
> }.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19655.
---
Resolution: Duplicate

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873189#comment-15873189
 ] 

Hyukjin Kwon commented on SPARK-19615:
--

Let me leave loosely related JIRAs - SPARK-9813 , SPARK-9874 and SPARK-15918

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873188#comment-15873188
 ] 

Hyukjin Kwon commented on SPARK-19615:
--

I remember I checked UNION operation in other DBMS and current behaviour is 
current and compliant. Could you maybe check and leave other references or 
DBMSes please?

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873184#comment-15873184
 ] 

Hyukjin Kwon commented on SPARK-19655:
--

I guess the problematic line is, 
https://github.com/apache/spark/blob/5857b9ac2d9808d9b89a5b29620b5052e2beebf5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L209-L213

This is a subset of SPARK-12449 because I guess we are currently pushing down 
columns and filters via 
https://github.com/apache/spark/blob/6a9a85b84decc2cbe1a0d8791118a0f91a62aa3f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L440-L512

I don;t think we are going to separately solve this problem only for 
{{count(*)}}. Could we resolve this as a duplicate?


> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19657) start-master.sh accidentally forces the use of a loopback address in master URL

2017-02-18 Thread George Hawkins (JIRA)
George Hawkins created SPARK-19657:
--

 Summary: start-master.sh accidentally forces the use of a loopback 
address in master URL
 Key: SPARK-19657
 URL: https://issues.apache.org/jira/browse/SPARK-19657
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.1.0
 Environment: Ubuntu 16.04
Reporter: George Hawkins


{{start-master.sh}} contains the line:

{noformat}
SPARK_MASTER_HOST="`hostname -f`"
{noformat}

{{\-f}} means get the FQDN - the assumption seems to be that this will always 
return a public IP address (note that if {{start-master.sh}} didn't force the 
hostname by specifying {{--host}} then the default behavior of {{Master}} is to 
sensibly default to a public IP).

I came across this when I started a master and it output:

{noformat}
17/02/16 23:03:32 INFO Master: Starting Spark master at spark://myhostname:7077
{noformat}

But my external slaves could not connect to this URL and I was mystified when 
on the master machine (with just one public IP address) the following both 
failed:

{noformat}
$ telnet 192.168.1.133 7077
$ telnet 127.0.0.1 7077
{noformat}

{{192.168.1.133}} was the machine's public IP address and {{Master}} seemed to 
be listening on neither the public IP address nor the loopback address. However 
the following worked:

{noformat}
$ telnet myhostname 7077
{noformat}

It turns out this is a quirk of Debian and Ubuntu systems - the hostname maps 
to a loopback address but not to the well known one {{127.0.0.1}}.

If you look in {{/etc/hosts}} you see:

{noformat}
127.0.0.1   localhost
127.0.1.1   myhostname
{noformat}

I looked at this many times before I noticed that it's not the same IP address 
on both lines (I never knew that the entire {{127.0.0.0/8}} address block is 
reserved for loopback purposes - see 
[localhost|https://en.wikipedia.org/wiki/Localhost] on Wikipedia).

Why do Debian and Ubuntu do this? It seems there was a good and explained 
reason for this way back in time - the {{127.0.1.1}} line used to always map to 
an FQDN, i.e. you'd expect to see:

{noformat}
127.0.0.1   localhost
127.0.1.1   myhostname.some.domain
{noformat}

The Debian reference manual used to include the following section:

{quote}
Some software (e.g., GNOME) expects the system hostname to be resolvable to an 
IP address with a canonical fully qualified domain name. This is really 
improper because system hostnames and domain names are two very different 
things; but there you have it. In order to support that software, it is 
necessary to ensure that the system hostname can be resolved.
{quote}

However the [hostname resolution 
section|https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution]
 in the current reference, while still mentioning issues with software like 
GNOME, no longer says that the {{127.0.1.1.}} entry will be an FQDN.

In this [bug report|https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719621] 
you can see them discussing the change in documentation, i.e. removing the 
statement that {{127.0.1.1}} always maps to an FQDN, but there's no explanation 
of the reason for the change (the stated original purpose of this entry in 
{{/etc/hosts}} seems to be lost by this change, so it seems odd not to explain 
it).

So while it may be uncommon that a Spark master doesn't have a static IP and an 
FQDN, in a real cluster setup, this setup is probably quite likely for people 
getting started with Spark - i.e. starting the master on their personal machine 
running Ubuntu on a network that uses DHCP. And it's quite confusing to find 
that {{start-master.sh}} has started the master on an address that isn't 
externally accessable (and it isn't immediately obvious from the master URL 
that this is the case).

The simple solution seems to be simply not to specify the {{--host}} argument 
in {{spark-master.sh}} unless the length of {{$SPARK_MASTER_HOST}} in non-zero. 
In this case the Spark logic (working in the Java/Scala world where it's far 
easier to query IP addresses, check if they're loopback addresses etc.) already 
works out a sensible default public IP address to use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-02-18 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873159#comment-15873159
 ] 

Liang-Chi Hsieh commented on SPARK-19217:
-

The native casting of UserDefinedType from/to other non UserDefinedType is 
lacking in current Spark SQL. It seems make sense to me because in some cases 
we may need to cast UserDefinedType to native data types which Spark SQL 
functions only support.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873148#comment-15873148
 ] 

Apache Spark commented on SPARK-19550:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/16984

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 2.2.0
>
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873140#comment-15873140
 ] 

hosein commented on SPARK-19655:


I think I should not use spark for my case...


> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873131#comment-15873131
 ] 

hosein commented on SPARK-19655:


if I want to count 100 million data, 100 million 1 returned over network for 
just count?

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873130#comment-15873130
 ] 

Herman van Hovell commented on SPARK-19655:
---

We current only push filters and columns down into the data source. Other 
things, like aggregation, are not supported.

What you are seeing is that spark pushes the filter and a dummy field (count 
doesn't require anything else) into data source, the aggregation is still done 
on the spark side.

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19550) Remove reflection, docs, build elements related to Java 7

2017-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873125#comment-15873125
 ] 

Apache Spark commented on SPARK-19550:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16983

> Remove reflection, docs, build elements related to Java 7
> -
>
> Key: SPARK-19550
> URL: https://issues.apache.org/jira/browse/SPARK-19550
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 2.2.0
>
>
> - Move external/java8-tests tests into core, streaming, sql and remove
> - Remove MaxPermGen and related options
> - Fix some reflection / TODOs around Java 8+ methods
> - Update doc references to 1.7/1.8 differences
> - Remove Java 7/8 related build profiles
> - Update some plugins for better Java 8 compatibility
> - Fix a few Java-related warnings



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873121#comment-15873121
 ] 

hosein commented on SPARK-19655:


I surprised too : )
if you have Vertica database you can test this part of code and monitor queries 
in Vertica, in my experience, select 1 appered

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873118#comment-15873118
 ] 

hosein commented on SPARK-19655:


how can I get count result from my Vertica table? is there  any optimized 
solution for do that ?

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 11:11 AM:
--

I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I supposed if I take JDBC driver jar file to Spark  and define JDBC url in my 
code, Spark works with this driver ...


was (Author: hosein_ey):
I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I supposed if I take Spark JDBC  jar file and define JDBC url in it, Spark 
works with this driver ...

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873114#comment-15873114
 ] 

Sean Owen commented on SPARK-19655:
---

That's more the JDBC-Vertica integration than Spark-JDBC integration for 
Vertica. It may not be relevant.
It could be a real problem then, if it's not fully pushing down the count, but 
I guess I'd be surprised. I don't know that part of the code well enough and 
don't see where it would generate a "select 1" either.

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 11:09 AM:
--

I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I supposed if I take Spark JDBC  jar file and define JDBC url in it, Spark 
works with this driver ...


was (Author: hosein_ey):
I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take Spark JDBC  jar file and define JDBC url in it, Spark works 
with this driver ...

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 11:07 AM:
--

I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take Spark JDBC  jar file and define JDBC url in it, Spark works 
with this driver ...


was (Author: hosein_ey):
I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take spark JDBC  jar file and define JDBC url in it, spark works 
with this driver ...

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 11:06 AM:
--

I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take spark JDBC  jar file and define JDBC url in it, spark works 
with this driver ...


was (Author: hosein_ey):
I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take spark JDBC  jar file and define JDBC url in spark. spark 
works with this driver ...

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 11:06 AM:
--

I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/

I suppose if I take spark JDBC  jar file and define JDBC url in spark. spark 
works with this driver ...


was (Author: hosein_ey):
I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/


> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873110#comment-15873110
 ] 

hosein commented on SPARK-19655:


I connect  to Vertica by JDBC and downloaded it's driver from this link:
https://my.vertica.com/download/vertica/client-drivers/


> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19376) CLONE - CheckAnalysis rejects TPCDS query 32

2017-02-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-19376:
---

> CLONE - CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-19376
> URL: https://issues.apache.org/jira/browse/SPARK-19376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Shahdadi
>Assignee: Nattavut Sutyanyong
>Priority: Minor
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Closed] (SPARK-19376) CLONE - CheckAnalysis rejects TPCDS query 32

2017-02-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19376.
-
  Resolution: Invalid
   Fix Version/s: (was: 2.1.0)
Target Version/s:   (was: 2.1.0)

> CLONE - CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-19376
> URL: https://issues.apache.org/jira/browse/SPARK-19376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Shahdadi
>Assignee: Nattavut Sutyanyong
>Priority: Minor
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873103#comment-15873103
 ] 

Sean Owen commented on SPARK-19653:
---

Related to https://issues.apache.org/jira/browse/SPARK-19217

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873102#comment-15873102
 ] 

Sean Owen commented on SPARK-19655:
---

There are queries issued to test the existence of tables in some dialects, but 
they issue "SELECT 1 ... LIMIT 1".
There isn't Vertica support in Spark. Is this query possibly coming from a 
third-party library?

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873093#comment-15873093
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 10:37 AM:
--

I have a Vertica database with 100 million rows and I run this code in spark:

 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()

result = df.filter(df['id'] > 100).count()

print result

I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("id" > 100)

this query returns about 100 million "1" and I think this is not suitable









was (Author: hosein_ey):
I have a Vertica database with 100 million rows and I run this code in spark:

 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()

result = df.filter(df['id'] > 100).count()

print result

I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable








> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873093#comment-15873093
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 10:36 AM:
--

I have a Vertica database with 100 million rows and I run this code in spark:

 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()

result = df.filter(df['id'] > 100).count()

print result

I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable









was (Author: hosein_ey):
I have a Vertica database with 100 million rows and I run this code in spark:

  
 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()

result = df.filter(df['id'] > 100).count()

print result
  

I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable








> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873093#comment-15873093
 ] 

hosein edited comment on SPARK-19655 at 2/18/17 10:36 AM:
--

I have a Vertica database with 100 million rows and I run this code in spark:

  
 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()

result = df.filter(df['id'] > 100).count()

print result
  

I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable









was (Author: hosein_ey):
I have a Vertica database with 100 million rows and I run this code in spark:
  
 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()
result = df.filter(df['id'] > 100).count()
print result


I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable








> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873093#comment-15873093
 ] 

hosein commented on SPARK-19655:


I have a Vertica database with 100 million rows and I run this code in spark:
  
 df = spark.read.format("jdbc").option("url" , 
vertica_jdbc_url).option("dbtable", 'test_table')
   .option("user", "spark_user").option("password" , "password").load()
result = df.filter(df['id'] > 100).count()
print result


I monitor queries in Vertica and spark code generates this query in Vertica:

SELECT 1 FROM test_table WHERE ("int_id" > 100)

this query returns about 100 million "1" and I think this is not suitable








> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-02-18 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873090#comment-15873090
 ] 

Nira Amit commented on SPARK-19656:
---

I also tried to do this without writing my own `AvroKey` and 
`AvroKeyInputFormat`:

{code}
JavaPairRDD records =
sc.newAPIHadoopFile("file:/path/to/file.avro",
new AvroKeyInputFormat().getClass(), new 
AvroKey().getClass(), NullWritable.class,
sc.hadoopConfiguration());
{code}

Which I think should have worked but instead results in a compilation error:
{code}
Error:(263, 36) java: incompatible types: inferred type does not conform to 
equality constraint(s)
inferred: 
org.apache.avro.mapred.AvroKey
equality constraints(s): 
org.apache.avro.mapred.AvroKey,capture#1 
of ? extends org.apache.avro.mapred.AvroKey
{code}


> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-02-18 Thread Nira Amit (JIRA)
Nira Amit created SPARK-19656:
-

 Summary: Can't load custom type from avro file to RDD with 
newAPIHadoopFile
 Key: SPARK-19656
 URL: https://issues.apache.org/jira/browse/SPARK-19656
 Project: Spark
  Issue Type: Question
  Components: Java API
Affects Versions: 2.0.2
Reporter: Nira Amit


If I understand correctly, in scala it's possible to load custom objects from 
avro files to RDDs this way:
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
I'm not a scala developer, so I tried to "translate" this to java as best I 
could. I created classes that extend AvroKey and FileInputFormat:
{code}
public static class MyCustomAvroKey extends AvroKey{};

public static class MyCustomAvroReader extends 
AvroRecordReaderBase {
// with my custom schema and all the required methods...
}
public static class MyCustomInputFormat extends 
FileInputFormat{

@Override
public RecordReader 
createRecordReader(InputSplit inputSplit, TaskAttemptContext 
taskAttemptContext) throws IOException, InterruptedException {
return new MyCustomAvroReader();
}
}
...
JavaPairRDD records =
sc.newAPIHadoopFile("file:/path/to/datafile.avro",
MyCustomInputFormat.class, MyCustomAvroKey.class,
NullWritable.class,
sc.hadoopConfiguration());
MyCustomClass first = records.first()._1.datum();
System.out.println("Got a result, some custom field: " + 
first.getSomeCustomField());
{code}
This compiles fine, but using a debugger I can see that `first._1.datum()` 
actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
instance.
And indeed, when the following line executes:
{code}
MyCustomClass first = records.first()._1.datum();
{code}
I get an exception:
{code}
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot 
be cast to my.package.containing.MyCustomClass
{code}
Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873087#comment-15873087
 ] 

Sean Owen edited comment on SPARK-19655 at 2/18/17 10:14 AM:
-

What is the problem? selecting "1" _is_ an optimization, or at least, should 
make no difference at all. EDIT: I assume you mean count(1), but, if not, can 
you provide an example of which DB that affects? because of course that's not 
even an aggregate function then.


was (Author: srowen):
What is the problem? selecting "1" _is_ an optimization, or at least, should 
make no difference at all.

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873087#comment-15873087
 ] 

Sean Owen commented on SPARK-19655:
---

What is the problem? selecting "1" _is_ an optimization, or at least, should 
make no difference at all.

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19376) CLONE - CheckAnalysis rejects TPCDS query 32

2017-02-18 Thread Mostafa Shahdadi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Shahdadi updated SPARK-19376:
-
Priority: Minor  (was: Blocker)

> CLONE - CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-19376
> URL: https://issues.apache.org/jira/browse/SPARK-19376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Shahdadi
>Assignee: Nattavut Sutyanyong
>Priority: Minor
> Fix For: 2.1.0
>
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Closed] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed

2017-02-18 Thread Armin Braun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armin Braun closed SPARK-19592.
---
Resolution: Won't Fix

> Duplication in Test Configuration Relating to SparkConf Settings Should be 
> Removed
> --
>
> Key: SPARK-19592
> URL: https://issues.apache.org/jira/browse/SPARK-19592
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0, 2.2.0
> Environment: Applies to all Environments
>Reporter: Armin Braun
>Priority: Minor
>
> This configuration for Surefire, Scalatest is duplicated in the parent POM as 
> well as the SBT build.
> While this duplication cannot be removed in general it can at least be 
> removed for all system properties that simply result in a SparkConf setting I 
> think.
> Instead of having lines like 
> {code}
> false
> {code}
> twice in the pom.xml
> and once in SBT as
> {code}
> javaOptions in Test += "-Dspark.ui.enabled=false",
> {code}
> it would be a lot cleaner to simply have a 
> {code}
> var conf: SparkConf 
> {code}
> field in 
> {code}
> org.apache.spark.SparkFunSuite
> {code}
>  that has SparkConf set up with all the shared configuration that 
> `systemProperties` currently provide. Obviously this cannot be done straight 
> away given that
> many subclasses of the parent suit do this, so I think it would be best to 
> simply add a method to the parent that provides this configuration for now
> and start refactoring away duplication in other suit setups from there step 
> by step until the sys properties can be removed from the pom and sbt.build.
> This makes the build a lot easier to maintain and makes tests more readable 
> by making the environment setup more explicit in the code.
> (also it would allow running more tests straight from the IDE which is always 
> a nice thing imo)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-19655:
---
Summary: select count(*) , requests 1 for each row  (was: select count(*) , 
requests 1 foreach row)

> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count(*) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19655) select count(*) , requests 1 for each row

2017-02-18 Thread hosein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-19655:
---
Description: 
when I want query select count( * ) by JDBC and monitor queries in database 
side, I see spark requests: select 1 for destination table
it means 1 for each row and it is not optimized

  was:
when I want query select count(*) by JDBC and monitor queries in database side, 
I see spark requests: select 1 for destination table
it means 1 for each row and it is not optimized


> select count(*) , requests 1 for each row
> -
>
> Key: SPARK-19655
> URL: https://issues.apache.org/jira/browse/SPARK-19655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: hosein
>Priority: Minor
>
> when I want query select count( * ) by JDBC and monitor queries in database 
> side, I see spark requests: select 1 for destination table
> it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19655) select count(*) , requests 1 foreach row

2017-02-18 Thread hosein (JIRA)
hosein created SPARK-19655:
--

 Summary: select count(*) , requests 1 foreach row
 Key: SPARK-19655
 URL: https://issues.apache.org/jira/browse/SPARK-19655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: hosein
Priority: Minor


when I want query select count(*) by JDBC and monitor queries in database side, 
I see spark requests: select 1 for destination table
it means 1 for each row and it is not optimized



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org