[jira] [Assigned] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16849:


Assignee: (was: Apache Spark)

> Improve subquery execution by deduplicating the subqueries with the same 
> results
> 
>
> Key: SPARK-16849
> URL: https://issues.apache.org/jira/browse/SPARK-16849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The subqueries in SparkSQL will be run even they have the same physical plan 
> and output same results. We should be able to deduplicate these subqueries 
> which are referred in a query for many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16849:


Assignee: Apache Spark

> Improve subquery execution by deduplicating the subqueries with the same 
> results
> 
>
> Key: SPARK-16849
> URL: https://issues.apache.org/jira/browse/SPARK-16849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> The subqueries in SparkSQL will be run even they have the same physical plan 
> and output same results. We should be able to deduplicate these subqueries 
> which are referred in a query for many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403394#comment-15403394
 ] 

Apache Spark commented on SPARK-16849:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/14452

> Improve subquery execution by deduplicating the subqueries with the same 
> results
> 
>
> Key: SPARK-16849
> URL: https://issues.apache.org/jira/browse/SPARK-16849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The subqueries in SparkSQL will be run even they have the same physical plan 
> and output same results. We should be able to deduplicate these subqueries 
> which are referred in a query for many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results

2016-08-01 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-16849:
---

 Summary: Improve subquery execution by deduplicating the 
subqueries with the same results
 Key: SPARK-16849
 URL: https://issues.apache.org/jira/browse/SPARK-16849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


The subqueries in SparkSQL will be run even they have the same physical plan 
and output same results. We should be able to deduplicate these subqueries 
which are referred in a query for many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403380#comment-15403380
 ] 

Xiao Li commented on SPARK-16842:
-

For each table, we just need to issue one query. That query will return an 
empty table. Normally, it is very cheap to most DBMS.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403376#comment-15403376
 ] 

Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 5:21 AM:
--

hm.. don't we make another connection and then run a query to fetch metadata 
for reading schema (separate query for fetching data)? This might be an 
overhead as much as touching a file.


was (Author: hyukjin.kwon):
hm.. don't we make a connection and then run a query to fetch metadata for 
reading schema? This might be an overhead as much as touching a file.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403376#comment-15403376
 ] 

Hyukjin Kwon commented on SPARK-16842:
--

hm.. don't we make a connection and then run a query to fetch metadata for 
reading schema? This might be an overhead as much as touching a file.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403364#comment-15403364
 ] 

Apache Spark commented on SPARK-16848:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14451

> Make jdbc() and read.format("jdbc") consistently throwing exception for 
> user-specified schema
> -
>
> Key: SPARK-16848
> URL: https://issues.apache.org/jira/browse/SPARK-16848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently,
> {code}
> spark.read.schema(StructType(Seq())).jdbc(...),show()
> {code}
> does not throws an exception whereas
> {code}
> spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show()
> {code}
> does as below:
> {code}
> jdbc does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified 
> schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at 
> org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351)
> {code}
> It'd make sense throwing the exception when user specifies schema identically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16848:


Assignee: Apache Spark

> Make jdbc() and read.format("jdbc") consistently throwing exception for 
> user-specified schema
> -
>
> Key: SPARK-16848
> URL: https://issues.apache.org/jira/browse/SPARK-16848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently,
> {code}
> spark.read.schema(StructType(Seq())).jdbc(...),show()
> {code}
> does not throws an exception whereas
> {code}
> spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show()
> {code}
> does as below:
> {code}
> jdbc does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified 
> schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at 
> org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351)
> {code}
> It'd make sense throwing the exception when user specifies schema identically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16848:


Assignee: (was: Apache Spark)

> Make jdbc() and read.format("jdbc") consistently throwing exception for 
> user-specified schema
> -
>
> Key: SPARK-16848
> URL: https://issues.apache.org/jira/browse/SPARK-16848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently,
> {code}
> spark.read.schema(StructType(Seq())).jdbc(...),show()
> {code}
> does not throws an exception whereas
> {code}
> spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show()
> {code}
> does as below:
> {code}
> jdbc does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified 
> schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at 
> org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351)
> {code}
> It'd make sense throwing the exception when user specifies schema identically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403358#comment-15403358
 ] 

Xiao Li commented on SPARK-16842:
-

I heard of a case. In one big Internet company, their use case could generate 
many small parquet files. They complaint the performance is slow

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403354#comment-15403354
 ] 

Xiao Li commented on SPARK-16842:
-

The overhead of schema parsing in JDBC is small, right?

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema

2016-08-01 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-16848:


 Summary: Make jdbc() and read.format("jdbc") consistently throwing 
exception for user-specified schema
 Key: SPARK-16848
 URL: https://issues.apache.org/jira/browse/SPARK-16848
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Trivial


Currently,

{code}
spark.read.schema(StructType(Seq())).jdbc(...),show()
{code}

does not throws an exception whereas

{code}
spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show()
{code}

does as below:

{code}
jdbc does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified 
schemas.;
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at 
org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351)
{code}

It'd make sense throwing the exception when user specifies schema identically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16847:


Assignee: Apache Spark

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16847:


Assignee: (was: Apache Spark)

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403337#comment-15403337
 ] 

Apache Spark commented on SPARK-16847:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14450

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16843:


Assignee: (was: Apache Spark)

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1540#comment-1540
 ] 

Apache Spark commented on SPARK-16843:
--

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14449

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16843:


Assignee: Apache Spark

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16847:
-
Summary: Prevent to potentially read corrupt statstics on binary in Parquet 
via VectorizedReader  (was: Do not read Parquet corrupt statstics on binary via 
VectorizedReader when it is corrupt)

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16847) Do not read Parquet corrupt statstics on binary via VectorizedReader when it is corrupt

2016-08-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16847:
-
Summary: Do not read Parquet corrupt statstics on binary via 
VectorizedReader when it is corrupt  (was: Do not read Parquet corrupt 
statstics on binary )

> Do not read Parquet corrupt statstics on binary via VectorizedReader when it 
> is corrupt
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16847) Do not read Parquet corrupt statstics on binary

2016-08-01 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-16847:


 Summary: Do not read Parquet corrupt statstics on binary 
 Key: SPARK-16847
 URL: https://issues.apache.org/jira/browse/SPARK-16847
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


It is still possible to read corrupt Parquet's statistics.

This problem was found in PARQUET-251 and we disabled filter pushdown on binary 
columns in Spark before.

We enabled this after upgrading Parquet but it seems there are potential 
incompatibility for Parquet files written in lower Spark versions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403315#comment-15403315
 ] 

Hyukjin Kwon commented on SPARK-16842:
--

If we don't support schema compatibility but should support user-specified 
schema (like throwing an exception while executing if the schema is wrong), we 
can enable JDBC to accept schema as well because there is an overhead to parse 
the schema in JDBC as well.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308
 ] 

Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 3:56 AM:
--

Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

Also, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* imply not supporting 
schema compatibility.

In that case, this one might be an option.



was (Author: hyukjin.kwon):
Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

Also, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308
 ] 

Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 3:56 AM:
--

Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

Also, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.



was (Author: hyukjin.kwon):
Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

So, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308
 ] 

Hyukjin Kwon commented on SPARK-16842:
--

Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

So, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16842:
-
Comment: was deleted

(was: Thanks for your feedback. Yea, but I think it might not be very heavy 
time consuming (yea but still it is) since we will touch a single file (for 
Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you 
already know.

So, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.
)

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403309#comment-15403309
 ] 

Hyukjin Kwon commented on SPARK-16842:
--

Thanks for your feedback. Yea, but I think it might not be very heavy time 
consuming (yea but still it is) since we will touch a single file (for Parquet 
and ORC) in most cases whereas JSON and CSV needs a entire scan as you already 
know.

So, this overhead in case of Parquet and ORC would be almost constant 
regardless of number of files or size.

I am personally supportive to allowing schema compatibility (and I did open a 
PR) but I saw some opinions and comments which *I assume* infers not supporting 
schema compatibility.

In that case, this one might be an option.


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15939) Clarify ml.linalg usage

2016-08-01 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-15939.

Resolution: Not A Problem

> Clarify ml.linalg usage
> ---
>
> Key: SPARK-15939
> URL: https://issues.apache.org/jira/browse/SPARK-15939
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Trivial
>
> 1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not 
> {{mllib.linalg}}
> 2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-16843:
--
Fix Version/s: (was: 2.0.1)
   2.1.0

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-16843:
--
Target Version/s:   (was: 2.0.1)

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
> Fix For: 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-16843:
--
Priority: Minor  (was: Major)

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-16843:
--
Affects Version/s: (was: 2.0.0)
   2.1.0

> Select features according to a percentile of the highest scores of 
> ChiSqSelector
> 
>
> Key: SPARK-16843
> URL: https://issues.apache.org/jira/browse/SPARK-16843
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be handy to add a percentile Param to ChiSqSelector, as in the 
> scikit-learn one: 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16846) read.csv() option: "inferSchema" don't work

2016-08-01 Thread hejie (JIRA)
hejie created SPARK-16846:
-

 Summary: read.csv()  option:  "inferSchema" don't work
 Key: SPARK-16846
 URL: https://issues.apache.org/jira/browse/SPARK-16846
 Project: Spark
  Issue Type: Bug
Reporter: hejie


I use the code to read file and get a dataframe. When the colum number is olny 
20, the inferSchema paragrama work well. But when the number is up to 400, it 
doesn't work, and I have to tell it the schema manually. the code is :


 val df = 
spark.read.schema(schema).options(Map("header"->"true","quote"->",","inferSchema"->"true")).csv("/Users/ss/Documents/traindata/traindataAllNumber.csv")




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions

2016-08-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16818:

Fix Version/s: 2.0.1

> Exchange reuse incorrectly reuses scans over different sets of partitions
> -
>
> Key: SPARK-16818
> URL: https://issues.apache.org/jira/browse/SPARK-16818
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> This happens because the file scan operator does not take into account 
> partition pruning in its implementation of `sameResult()`. As a result, 
> executions may be incorrect on self-joins over the same base file relation. 
> Here's a minimal test case to reproduce:
> {code}
> spark.conf.set("spark.sql.exchange.reuse", true)  // defaults to true in 
> 2.0
> withTempPath { path =>
>   val tempDir = path.getCanonicalPath
>   spark.range(10)
> .selectExpr("id % 2 as a", "id % 3 as b", "id as c")
> .write
> .partitionBy("a")
> .parquet(tempDir)
>   val df = spark.read.parquet(tempDir)
>   val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum")
>   val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum")
>   checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, 
> 10, 5) :: Nil)
> {code}
> When exchange reuse is on, the result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6| 6|
> |  1| 4| 4|
> |  2|10|10|
> +---+--+--+
> {code}
> The correct result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6|12|
> |  1| 4| 8|
> |  2|10| 5|
> +---+--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-01 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403262#comment-15403262
 ] 

Sylvain Zimmer commented on SPARK-16826:


[~srowen] what about this? 
https://github.com/sylvinus/spark/commit/98119a08368b1cd1faf3f25a32910ad6717c5c02

The tests seem to pass and I don't think it uses the problematic code paths in 
java.net.URL (except for getFile but that may be could be fixed easily)

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16579) Add a spark install function

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403256#comment-15403256
 ] 

Apache Spark commented on SPARK-16579:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14448

> Add a spark install function
> 
>
> Key: SPARK-16579
> URL: https://issues.apache.org/jira/browse/SPARK-16579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> As described in the design doc we need to introduce a function to install 
> Spark in case the user directly downloads SparkR from CRAN.
> To do that we can introduce a install_spark function that takes in the 
> following arguments
> {code}
> hadoop_version
> url_to_use # defaults to apache
> local_dir # defaults to a cache dir
> {code} 
> Further more I think we can automatically run this from sparkR.init if we 
> find Spark home and the JARs missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403255#comment-15403255
 ] 

Xiao Li commented on SPARK-16842:
-

When users specify the schema, we do not need to discover the schema, right? 
Schema discovery could be very time consuming. Thus, IMO, it is still 
reasonable to let users specify the schema. 

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message

2016-08-01 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403247#comment-15403247
 ] 

Tao Wang commented on SPARK-14559:
--

Hi [~zsxwing], Sadly the application is ended now so i can't get the thread 
info :(

but i can be sure the AM is ok at the moment(even the 2 attempts both failed as 
too many executor failed).

another point is that after AM attempt 1 failed attempt 2 started at 11:30, but 
the RegisterClusterManager message is handled by driver at around  at 18:30. 

the dispatch thread which handle the RegisterClusterManager  message in thread 
pool(40 thread in total) is busy all the time while some other threads are idle.

So we doubt if some logic in dispatching message has some corner case for us to 
cover. 

This is all we can get from the log. If you need other information i will try 
to find them in logs which are we all have :(

> Netty RPC didn't check channel is active before sending message
> ---
>
> Key: SPARK-14559
> URL: https://issues.apache.org/jira/browse/SPARK-14559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65
>Reporter: cen yuhai
>
> I have a long-running service. After running for serveral hours, It throwed 
> these exceptions. I  found that before sending rpc request by calling sendRpc 
> method in TransportClient, there is no check that whether the channel is 
> still open or active ?
> java.nio.channels.ClosedChannelException
>  4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 5635696155204230556 to 
> bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio.
>   channels.ClosedChannelException
>  4866 java.nio.channels.ClosedChannelException
>  4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 7319486003318455703 to 
> bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio.
>   channels.ClosedChannelException
>  4868 java.nio.channels.ClosedChannelException
>  4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9041854451893215954 to 
> bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio.
>   channels.ClosedChannelException
>  4870 java.nio.channels.ClosedChannelException
>  4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 6046473497871624501 to 
> bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio.  
>   channels.ClosedChannelException
>  4872 java.nio.channels.ClosedChannelException
>  4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9085605650438705047 to 
> bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio.
>   channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238
 ] 

Sean Zhong edited comment on SPARK-16320 at 8/2/16 2:22 AM:


[~maver1ck] Can you check whether the PR works for you?



was (Author: clockfly):
[~loziniak] Can you check whether the PR works for you?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238
 ] 

Sean Zhong commented on SPARK-16320:


[~loziniak] Can you check whether the PR works for you?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-01 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403236#comment-15403236
 ] 

Sylvain Zimmer commented on SPARK-16826:


Sorry I can't be more helpful on the Java side... But I think there must be 
some high-quality URL parsing code somewhere in the Apache foundation already 
:-)

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-08-01 Thread hejie (JIRA)
hejie created SPARK-16845:
-

 Summary: 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
 Key: SPARK-16845
 URL: https://issues.apache.org/jira/browse/SPARK-16845
 Project: Spark
  Issue Type: Bug
  Components: Java API, ML, MLlib
Affects Versions: 2.0.0
Reporter: hejie


I have a wide table(400 columns), when I try fitting the traindata on all 
columns,  the fatal error occurs. 


... 46 more
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403222#comment-15403222
 ] 

Sean Owen commented on SPARK-16826:
---

URI.toURL just follows the same code path. Does URI itself parse all the same 
fields? Didn't think so because URIs are a superset of URLs.

Definitely open to suggestions. Anything that can parse the same fields 
respectably is OK.

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16844) Generate code for sort based aggregation

2016-08-01 Thread yucai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403202#comment-15403202
 ] 

yucai commented on SPARK-16844:
---

We are working on the whole stage code gen for the sort based aggregation.
PR and test report will be sent soon.

> Generate code for sort based aggregation
> 
>
> Key: SPARK-16844
> URL: https://issues.apache.org/jira/browse/SPARK-16844
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: yucai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16844) Generate code for sort based aggregation

2016-08-01 Thread yucai (JIRA)
yucai created SPARK-16844:
-

 Summary: Generate code for sort based aggregation
 Key: SPARK-16844
 URL: https://issues.apache.org/jira/browse/SPARK-16844
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: yucai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector

2016-08-01 Thread Peng Meng (JIRA)
Peng Meng created SPARK-16843:
-

 Summary: Select features according to a percentile of the highest 
scores of ChiSqSelector
 Key: SPARK-16843
 URL: https://issues.apache.org/jira/browse/SPARK-16843
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Peng Meng
 Fix For: 2.0.1


It would be handy to add a percentile Param to ChiSqSelector, as in the 
scikit-learn one: 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-01 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403165#comment-15403165
 ] 

Sylvain Zimmer edited comment on SPARK-16826 at 8/2/16 1:15 AM:


[~srowen] thanks for the pointers! 

I'm parsing every hyperlink found in Common Crawl, so there are billions of 
unique ones, no way around it.

Wouldn't it be possible to switch to another implementation with an API similar 
to java.net.URL? As I understand it we never need the URLStreamHandler in the 
first place anyway?

I'm not a Java expert but what about {{java.net.URI}} or 
{{org.apache.catalina.util.URL}} for instance?



was (Author: sylvinus):
[~srowen] thanks for the pointers! 

I'm parsing every hyperlink found in Common Crawl, so there are billions of 
unique ones, no way around it.

Wouldn't it be possible to switch to another implementation with an API similar 
to java.net.URL? As I understand it we never need the URLStreamHandler in the 
first place anyway?

I'm not a Java expert but what about {java.net.URI} or 
{org.apache.catalina.util.URL} for instance?


> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-01 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403165#comment-15403165
 ] 

Sylvain Zimmer commented on SPARK-16826:


[~srowen] thanks for the pointers! 

I'm parsing every hyperlink found in Common Crawl, so there are billions of 
unique ones, no way around it.

Wouldn't it be possible to switch to another implementation with an API similar 
to java.net.URL? As I understand it we never need the URLStreamHandler in the 
first place anyway?

I'm not a Java expert but what about {java.net.URI} or 
{org.apache.catalina.util.URL} for instance?


> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16842:
-
Description: 
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled 
differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric 
types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility 
(except for very few cases, e.g. for Parquet, 
https://github.com/apache/spark/pull/14272 and 
https://github.com/apache/spark/pull/14278)

- For Text
  it only supports {{StringType}}.

- For JDBC
  it does not take user-given schema since it does not implement 
{{SchemaRelationProvider}}.

By allowing the user-given schema, we can use some types such as {{DateType}} 
and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive 
schema.

To cut this short, JSON and CSV do not have the complete schema information 
written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema for Parquet and 
Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
all times if my understanding it correct. 

  was:
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled 
differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric 
types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility 
(except for very few cases, e.g. for Parquet, 
https://github.com/apache/spark/pull/14272 and 
https://github.com/apache/spark/pull/14278)

- For Text
  it only supports {{StringType}}.

- For JDBC
  it does not take user-given schema since it does not implement 
{{SchemaRelationProvider}}.

By allowing the user-given schema, we can use some types such as {{DateType}} 
and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive 
schema.

To cut this short, JSON and CSV do not have the complete schema information 
written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't 
give a different schema for Orc and Parquet almost at all times if my 
understanding it correct. 


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16842:
-
Description: 
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled 
differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric 
types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility 
(except for very few cases, e.g. for Parquet, 
https://github.com/apache/spark/pull/14272 and 
https://github.com/apache/spark/pull/14278)

- For Text
  it only supports {{StringType}}.

- For JDBC
  it does not take user-given schema since it does not implement 
{{SchemaRelationProvider}}.

By allowing the user-given schema, we can use some types such as {{DateType}} 
and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive 
schema.

To cut this short, JSON and CSV do not have the complete schema information 
written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't 
give a different schema for Orc and Parquet almost at all times if my 
understanding it correct. 

  was:
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled 
differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric 
types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility 
(except for very few cases, e.g. for Parquet, 
https://github.com/apache/spark/pull/14272 and 
https://github.com/apache/spark/pull/14278)

- For Text
  it only supports `StringType`.

- For JDBC
  it does not take user-given schema since it does not implement 
`SchemaRelationProvider`.

By allowing the user-given schema, we can use some types such as {{DateType}} 
and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably permissive 
schema.

To cut this short, JSON and CSV do not have the complete schema information 
written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't 
give schemas for Orc and Parquet almost at all times if my understanding it 
correct. 


> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema. Actually, we 
> can't give a different schema for Orc and Parquet almost at all times if my 
> understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403139#comment-15403139
 ] 

Hyukjin Kwon commented on SPARK-16842:
--

Let me cc [~liancheng], [~smilegator] [~dongjoon] and [~cloud_fan] who I think 
are related with this JIRA.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports `StringType`.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> `SchemaRelationProvider`.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema. Actually, we 
> can't give schemas for Orc and Parquet almost at all times if my 
> understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403137#comment-15403137
 ] 

Apache Spark commented on SPARK-16445:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14447

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16828) remove MaxOf and MinOf

2016-08-01 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16828.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14434
[https://github.com/apache/spark/pull/14434]

> remove MaxOf and MinOf
> --
>
> Key: SPARK-16828
> URL: https://issues.apache.org/jira/browse/SPARK-16828
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-01 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-16842:


 Summary: Concern about disallowing user-given schema for Parquet 
and ORC
 Key: SPARK-16842
 URL: https://issues.apache.org/jira/browse/SPARK-16842
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled 
differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric 
types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility 
(except for very few cases, e.g. for Parquet, 
https://github.com/apache/spark/pull/14272 and 
https://github.com/apache/spark/pull/14278)

- For Text
  it only supports `StringType`.

- For JDBC
  it does not take user-given schema since it does not implement 
`SchemaRelationProvider`.

By allowing the user-given schema, we can use some types such as {{DateType}} 
and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably permissive 
schema.

To cut this short, JSON and CSV do not have the complete schema information 
written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't 
give schemas for Orc and Parquet almost at all times if my understanding it 
correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403116#comment-15403116
 ] 

Charles Allen commented on SPARK-16798:
---

Yep, still happens:

{code}
16/08/02 00:41:17 INFO HadoopRDD: Input split: REDACTED.gz:0+7389144
16/08/02 00:41:17 INFO TorrentBroadcast: Started reading broadcast variable 0
16/08/02 00:41:17 INFO TransportClientFactory: Successfully created connection 
to /<> after 1 ms (0 ms spent in bootstraps)
16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 18.2 KB, free 3.6 GB)
16/08/02 00:41:17 INFO TorrentBroadcast: Reading broadcast variable 0 took 34 ms
16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 209.2 KB, free 3.6 GB)
16/08/02 00:41:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
16/08/02 00:41:18 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
16/08/02 00:41:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
16/08/02 00:41:18 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
16/08/02 00:41:18 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
16/08/02 00:41:18 INFO NativeS3FileSystem: Opening 'REDACTED.gz' for reading
16/08/02 00:41:18 INFO CodecPool: Got brand-new decompressor [.gz]
16/08/02 00:41:19 ERROR Executor: Exception in task 11.0 in stage 0.0 (TID 11)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException

2016-08-01 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403110#comment-15403110
 ] 

Miao Wang commented on SPARK-16802:
---

With latest code, it should have been fixed. I re-run the test code for 10+ 
mintues. 

> joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
> 
>
> Key: SPARK-16802
> URL: https://issues.apache.org/jira/browse/SPARK-16802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>Priority: Critical
>
> Hello!
> This is a little similar to 
> [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I 
> have reopened it?).
> I would recommend to give another full review to {{HashedRelation.scala}}, 
> particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors 
> that I haven't managed to reproduce so far, as well as what I suspect could 
> be memory leaks (I have a query in a loop OOMing after a few iterations 
> despite not caching its results).
> Here is the script to reproduce the ArrayIndexOutOfBoundsException on the 
> current 2.0 branch:
> {code}
> import os
> import random
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> def randlong():
> return random.randint(-9223372036854775808, 9223372036854775807)
> while True:
> l1, l2 = randlong(), randlong()
> # Sample values that crash:
> # l1, l2 = 4661454128115150227, -5543241376386463808
> print "Testing with %s, %s" % (l1, l2)
> data1 = [(l1, ), (l2, )]
> data2 = [(l1, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> sqlc.registerDataFrameAsTable(df1, "df1")
> sqlc.registerDataFrameAsTable(df2, "df2")
> result_df = sqlc.sql("""
> SELECT
> df1.id1
> FROM df1
> LEFT OUTER JOIN df2 ON df1.id1 = df2.id2
> """)
> print result_df.collect()
> {code}
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1728150825
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)

[jira] [Commented] (SPARK-16832) CrossValidator and TrainValidationSplit are not random without seed

2016-08-01 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403067#comment-15403067
 ] 

Bryan Cutler commented on SPARK-16832:
--

The default seed value is a constant, this is the trait where it is assigned 
[here|https://github.com/apache/spark/blob/v2.0.0/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L310].
  In order to get the behavior you want, you would need to explicitly set the 
seed for each run with unique values, as you mentioned.

> CrossValidator and TrainValidationSplit are not random without seed
> ---
>
> Key: SPARK-16832
> URL: https://issues.apache.org/jira/browse/SPARK-16832
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> Repeatedly running CrossValidator or TrainValidationSplit without an explicit 
> seed parameter does not change results. It is supposed to be seeded with a 
> random seed, but it seems to be instead seeded with some constant. (If seed 
> is explicitly provided, the two classes behave as expected.)
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>(Vectors.dense([0.4]), 1.0),
>(Vectors.dense([0.5]), 0.0),
>(Vectors.dense([0.6]), 1.0),
>(Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> tvs = 
> pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),
>  
>estimatorParamMaps=paramGrid,
>
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>trainRatio=0.8)
> model = tvs.fit(train)
> print(model.validationMetrics)
> for folds in (3, 5, 10):
>   cv = 
> pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(),
>  
>   estimatorParamMaps=paramGrid, 
>   
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>   numFolds=folds
>  )
>   cvModel = cv.fit(dataset)
>   print(folds, cvModel.avgMetrics)
> {code}
> This code produces identical results upon repeated calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16841:


Assignee: Apache Spark

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403053#comment-15403053
 ] 

Apache Spark commented on SPARK-16841:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14446

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16841:


Assignee: (was: Apache Spark)

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-01 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-16841:
---
Summary: Improves the row level metrics performance when reading Parquet 
table  (was: Improve the row level metrics performance when reading Parquet 
table)

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16841) Improve the row level metrics performance when reading Parquet table

2016-08-01 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16841:
--

 Summary: Improve the row level metrics performance when reading 
Parquet table
 Key: SPARK-16841
 URL: https://issues.apache.org/jira/browse/SPARK-16841
 Project: Spark
  Issue Type: Improvement
Reporter: Sean Zhong


When reading Parquet table, Spark adds row level metrics like recordsRead, 
bytesRead 
(https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).

The implementation is not very efficient. When parquet vectorized reader is not 
used, it may take 20% of read time to update these metrics.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException

2016-08-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16802:
--

Assignee: Davies Liu

> joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
> 
>
> Key: SPARK-16802
> URL: https://issues.apache.org/jira/browse/SPARK-16802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>Priority: Critical
>
> Hello!
> This is a little similar to 
> [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I 
> have reopened it?).
> I would recommend to give another full review to {{HashedRelation.scala}}, 
> particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors 
> that I haven't managed to reproduce so far, as well as what I suspect could 
> be memory leaks (I have a query in a loop OOMing after a few iterations 
> despite not caching its results).
> Here is the script to reproduce the ArrayIndexOutOfBoundsException on the 
> current 2.0 branch:
> {code}
> import os
> import random
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> def randlong():
> return random.randint(-9223372036854775808, 9223372036854775807)
> while True:
> l1, l2 = randlong(), randlong()
> # Sample values that crash:
> # l1, l2 = 4661454128115150227, -5543241376386463808
> print "Testing with %s, %s" % (l1, l2)
> data1 = [(l1, ), (l2, )]
> data2 = [(l1, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> sqlc.registerDataFrameAsTable(df1, "df1")
> sqlc.registerDataFrameAsTable(df2, "df2")
> result_df = sqlc.sql("""
> SELECT
> df1.id1
> FROM df1
> LEFT OUTER JOIN df2 ON df1.id1 = df2.id2
> """)
> print result_df.collect()
> {code}
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1728150825
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at 

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402961#comment-15402961
 ] 

Apache Spark commented on SPARK-16320:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14445

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16320:


Assignee: (was: Apache Spark)

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16320:


Assignee: Apache Spark

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16840) Please save the aggregate term frequencies as part of the NaiveBayesModel

2016-08-01 Thread Barry Becker (JIRA)
Barry Becker created SPARK-16840:


 Summary: Please save the aggregate term frequencies as part of the 
NaiveBayesModel
 Key: SPARK-16840
 URL: https://issues.apache.org/jira/browse/SPARK-16840
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.0.0, 1.6.2
Reporter: Barry Becker


I would like to visualize the structure of the NaiveBayes model in order to get 
additional insight into the patterns in the data. In order to do that I need 
the frequencies for each feature value per label.

This exact information is computed in the  NaiveBayes.run method (see 
"aggregated" variable), but then discarded when creating the model. Pi and 
theta are computed based on the aggregated frequency counts, but surprisingly 
those counts are not needed to apply the model. It would not add much to the 
model size to add these aggregated counts, but could be very useful for some 
applications of the model.

{code}
  def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
 :
// Aggregates term frequencies per label.
val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, 
DenseVector)](
  createCombiner = (v: Vector) => {
:
  },
:
new NaiveBayesModel(labels, pi, theta, modelType) // <- please include 
"aggregated" here.
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402920#comment-15402920
 ] 

Apache Spark commented on SPARK-16839:
--

User 'eyalfa' has created a pull request for this issue:
https://github.com/apache/spark/pull/1

> CleanupAliases may leave redundant aliases at end of analysis state
> ---
>
> Key: SPARK-16839
> URL: https://issues.apache.org/jira/browse/SPARK-16839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Eyal Farago
>Priority: Minor
>  Labels: alias, analysis, analyzers, sql, struct
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> [SPARK-9634] [SPARK-9323] [SQL]  introduced CleanupReferences which removes 
> unnecessary Aliases while keeping required ones such as top level Projection 
> and struct attributes. this mechanism is implemented by maintaining a boolean 
> flag during a top-down expression transformation, I found a case where this 
> mechanism leaves redundant aliases in the tree (within a right sibling of a 
> create_struct node).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16839:


Assignee: (was: Apache Spark)

> CleanupAliases may leave redundant aliases at end of analysis state
> ---
>
> Key: SPARK-16839
> URL: https://issues.apache.org/jira/browse/SPARK-16839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Eyal Farago
>Priority: Minor
>  Labels: alias, analysis, analyzers, sql, struct
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> [SPARK-9634] [SPARK-9323] [SQL]  introduced CleanupReferences which removes 
> unnecessary Aliases while keeping required ones such as top level Projection 
> and struct attributes. this mechanism is implemented by maintaining a boolean 
> flag during a top-down expression transformation, I found a case where this 
> mechanism leaves redundant aliases in the tree (within a right sibling of a 
> create_struct node).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16839:


Assignee: Apache Spark

> CleanupAliases may leave redundant aliases at end of analysis state
> ---
>
> Key: SPARK-16839
> URL: https://issues.apache.org/jira/browse/SPARK-16839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Eyal Farago
>Assignee: Apache Spark
>Priority: Minor
>  Labels: alias, analysis, analyzers, sql, struct
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> [SPARK-9634] [SPARK-9323] [SQL]  introduced CleanupReferences which removes 
> unnecessary Aliases while keeping required ones such as top level Projection 
> and struct attributes. this mechanism is implemented by maintaining a boolean 
> flag during a top-down expression transformation, I found a case where this 
> mechanism leaves redundant aliases in the tree (within a right sibling of a 
> create_struct node).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state

2016-08-01 Thread Eyal Farago (JIRA)
Eyal Farago created SPARK-16839:
---

 Summary: CleanupAliases may leave redundant aliases at end of 
analysis state
 Key: SPARK-16839
 URL: https://issues.apache.org/jira/browse/SPARK-16839
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.1
Reporter: Eyal Farago
Priority: Minor


[SPARK-9634] [SPARK-9323] [SQL]  introduced CleanupReferences which removes 
unnecessary Aliases while keeping required ones such as top level Projection 
and struct attributes. this mechanism is implemented by maintaining a boolean 
flag during a top-down expression transformation, I found a case where this 
mechanism leaves redundant aliases in the tree (within a right sibling of a 
create_struct node).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-08-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15869.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.1.0
   2.0.1

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores

2016-08-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402847#comment-15402847
 ] 

Sean Owen commented on SPARK-16834:
---

Hm, I see. Is it due to the bug you found in 
https://issues.apache.org/jira/browse/SPARK-16831 causing the metrics to almost 
always be too large for the CrossValidationModel?

Why use different data sets in both cases? to make this a direct comparison, 
train both on the same data, and eval on the same set.

> TrainValildationSplit and direct evaluation produce different scores
> 
>
> Key: SPARK-16834
> URL: https://issues.apache.org/jira/browse/SPARK-16834
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>
> The two segments of code below are supposed to do the same thing: one is 
> using TrainValidationSplit, the other performs the same evaluation manually. 
> However, their results are statistically different (in my case, in a loop of 
> 20, I regularly get ~19 True values). 
> Unfortunately, I didn't find the bug in the source code.
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>(Vectors.dense([0.4]), 1.0),
>(Vectors.dense([0.5]), 0.0),
>(Vectors.dense([0.6]), 1.0),
>(Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> # note that test is NEVER used in this code
> # I create it only to utilize randomSplit
> for i in range(20):
>   train, test = dataset.randomSplit([0.8, 0.2])
>   tvs = 
> pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),
>  
>  estimatorParamMaps=paramGrid,
>  
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>  trainRatio=0.5)
>   model = tvs.fit(train)
>   train, val, test = dataset.randomSplit([0.4, 0.4, 0.2])
>   lr=pyspark.ml.regression.LinearRegression()
>   evaluator=pyspark.ml.evaluation.RegressionEvaluator()
>   lrModel = lr.fit(train)
>   predicted = lrModel.transform(val)
>   print(model.validationMetrics[0] < evaluator.evaluate(predicted))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16548.
---
Resolution: Won't Fix

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16495) Add ADMM optimizer in mllib package

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16495.
---
Resolution: Later

> Add ADMM optimizer in mllib package
> ---
>
> Key: SPARK-16495
> URL: https://issues.apache.org/jira/browse/SPARK-16495
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: zunwen you
>
>  Alternating Direction Method of Multipliers (ADMM) is well suited to 
> distributed convex optimization, and in particular to large-scale problems 
> arising in statistics, machine learning, and related areas.
> Details can be found in the [S. Boyd's 
> paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16465) Add nonnegative flag to mllib ALS

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16465.
---
Resolution: Won't Fix

> Add nonnegative flag to mllib ALS
> -
>
> Key: SPARK-16465
> URL: https://issues.apache.org/jira/browse/SPARK-16465
> Project: Spark
>  Issue Type: New Feature
>Reporter: Roberto Pagliari
>Priority: Minor
>
> Currently, this flag is available in ml, not in mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16801) clearThreshold does not work for SparseVector

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16801.
---
Resolution: Not A Problem

> clearThreshold does not work for SparseVector
> -
>
> Key: SPARK-16801
> URL: https://issues.apache.org/jira/browse/SPARK-16801
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.2
>Reporter: Rahul Shah
>Priority: Minor
>
> LogisticRegression model of mllib library performs randomly when passed with 
> an SparseVector instead of DenseVector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16774:
--
Assignee: holdenk
Priority: Minor  (was: Major)

> Fix use of deprecated TimeStamp constructor (also providing incorrect results)
> --
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does not handle 
> DST boundaries correctly all the time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7445) StringIndexer should handle binary labels properly

2016-08-01 Thread Ruben Janssen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402786#comment-15402786
 ] 

Ruben Janssen edited comment on SPARK-7445 at 8/1/16 8:57 PM:
--

I'd be interested to work on this.

Before I start however, just to clarify:
'Another option is to allow users to provide a list or labels and we use the 
ordering.' I think you mean 'Another option is to allow users to provide a list 
OF labels and use THAT GIVEN ordering.'?

If that is the case, I would advocate the latter because having binary labels 
does not necessarily imply that we have negatives and positives. We could have 
"left"/"right" for example. This would also be more flexible for the users and 
does not have to be limited to just binary labels. 


was (Author: rubenjanssen):
I'd be interested to work on this.

Before I start however, just to clarify:
'Another option is to allow users to provide a list or labels and we use the 
ordering.' I think you mean 'Another option is to allow users to provide a list 
OF labels and use THAT GIVEN ordering.'?

If that is the case, I would advocate that because having binary labels does 
not necessarily imply that we have negatives and positives. We could have 
"left"/"right" for example. This would also be more flexible for the users and 
does not have to be limited to just binary labels. 

> StringIndexer should handle binary labels properly
> --
>
> Key: SPARK-7445
> URL: https://issues.apache.org/jira/browse/SPARK-7445
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> StringIndexer orders labels by their counts. However, for binary labels, we 
> should really map negatives to 0 and positive to 1. So can put special rules 
> for binary labels:
> 1. "+1"/"-1", "1"/"-1", "1"/"0"
> 2. "yes"/"no"
> 3. "true"/"false"
> Another option is to allow users to provide a list or labels and we use the 
> ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)

2016-08-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16774.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14398
[https://github.com/apache/spark/pull/14398]

> Fix use of deprecated TimeStamp constructor (also providing incorrect results)
> --
>
> Key: SPARK-16774
> URL: https://issues.apache.org/jira/browse/SPARK-16774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
> Fix For: 2.0.1, 2.1.0
>
>
> The TimeStamp constructor we use inside of DateTime utils has been deprecated 
> since JDK 1.1 - while Java does take a long time to remove deprecated 
> functionality we might as well address this. Additionally it does not handle 
> DST boundaries correctly all the time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7445) StringIndexer should handle binary labels properly

2016-08-01 Thread Ruben Janssen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402786#comment-15402786
 ] 

Ruben Janssen commented on SPARK-7445:
--

I'd be interested to work on this.

Before I start however, just to clarify:
'Another option is to allow users to provide a list or labels and we use the 
ordering.' I think you mean 'Another option is to allow users to provide a list 
OF labels and use THAT GIVEN ordering.'?

If that is the case, I would advocate that because having binary labels does 
not necessarily imply that we have negatives and positives. We could have 
"left"/"right" for example. This would also be more flexible for the users and 
does not have to be limited to just binary labels. 

> StringIndexer should handle binary labels properly
> --
>
> Key: SPARK-7445
> URL: https://issues.apache.org/jira/browse/SPARK-7445
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> StringIndexer orders labels by their counts. However, for binary labels, we 
> should really map negatives to 0 and positive to 1. So can put special rules 
> for binary labels:
> 1. "+1"/"-1", "1"/"-1", "1"/"0"
> 2. "yes"/"no"
> 3. "true"/"false"
> Another option is to allow users to provide a list or labels and we use the 
> ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-01 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402739#comment-15402739
 ] 

Davies Liu commented on SPARK-16700:


There are two separate problems here:

1) Spark 2.0 enforce data type checking when creating a DataFrame, it's safer 
but slower. It makes sense to have a flag for that (on by default)

2) Row object is similar to named tuple (not dict), the columns are ordered. 
When it's created in a way like dict, we have no way to know the order of 
columns, so they are sorted by name, then it does not match with the schema 
provided. We should check the schema (order of columns) when create a DataFrame 
from RDD of Row (we assume they matched)

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15869:


Assignee: Apache Spark

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15869:


Assignee: (was: Apache Spark)

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402730#comment-15402730
 ] 

Apache Spark commented on SPARK-15869:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14443

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16792) Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)

2016-08-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-16792:
-
Component/s: (was: Spark Core)
 SQL

> Dataset containing a Case Class with a List type causes a CompileException 
> (converting sequence to list)
> 
>
> Key: SPARK-16792
> URL: https://issues.apache.org/jira/browse/SPARK-16792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jamie Hutton
>Priority: Critical
>
> The issue occurs when we run a .map over a dataset containing Case Class with 
> a List in it. A self contained test case is below:
> case class TestCC(key: Int, letters: List[String]) //List causes the issue - 
> a Seq/Array works fine
> /*simple test data*/
> val ds1 = sc.makeRDD(Seq(
> (List("D")),
> (List("S","H")),
> (List("F","H")),
> (List("D","L","L"))
> )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
> //This will fail
> val test1=ds1.map{_.key}
> test1.show
> Error: 
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 72, Column 70: No applicable constructor/method found 
> for actual parameters "int, scala.collection.Seq"; candidates are: 
> "TestCC(int, scala.collection.immutable.List)"
> It seems to be internally converting the List to a sequence, then it cant 
> convert it back...
> If you change the List[String] to Seq[String] or Array[String] the issue 
> doesnt appear



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message

2016-08-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402712#comment-15402712
 ] 

Shixiong Zhu commented on SPARK-14559:
--

[~WangTao] Could you check the AM process? Looks like it's down. If it's still 
alive, could you provide the thread dump, please?

> Netty RPC didn't check channel is active before sending message
> ---
>
> Key: SPARK-14559
> URL: https://issues.apache.org/jira/browse/SPARK-14559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65
>Reporter: cen yuhai
>
> I have a long-running service. After running for serveral hours, It throwed 
> these exceptions. I  found that before sending rpc request by calling sendRpc 
> method in TransportClient, there is no check that whether the channel is 
> still open or active ?
> java.nio.channels.ClosedChannelException
>  4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 5635696155204230556 to 
> bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio.
>   channels.ClosedChannelException
>  4866 java.nio.channels.ClosedChannelException
>  4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 7319486003318455703 to 
> bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio.
>   channels.ClosedChannelException
>  4868 java.nio.channels.ClosedChannelException
>  4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9041854451893215954 to 
> bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio.
>   channels.ClosedChannelException
>  4870 java.nio.channels.ClosedChannelException
>  4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 6046473497871624501 to 
> bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio.  
>   channels.ClosedChannelException
>  4872 java.nio.channels.ClosedChannelException
>  4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9085605650438705047 to 
> bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio.
>   channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16836) Hive date/time function error

2016-08-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16836:

Description: 
Previously available hive functions for date/time are not available in Spark 
2.0 (e.g. current_date, current_timestamp). These functions work in Spark 1.6.2 
with HiveContext.

Example (from spark-shell):
{noformat}
scala> spark.sql("select current_date")
org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
input columns: []; line 1 pos 7
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided

{noformat}


  was:
Previously available hive functions for date/time are not available in Spark 
2.0 (e.g. current_date, current_timestamp). These functions work in Spark 1.6.2 
with HiveContext.

Example (from spark-shell):

scala> spark.sql("select current_date")
org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
input columns: []; line 1 pos 7
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 

[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676
 ] 

Charles Allen edited comment on SPARK-16798 at 8/1/16 7:30 PM:
---

Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only (MMX) release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.


was (Author: drcrallen):
Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676
 ] 

Charles Allen commented on SPARK-16798:
---

Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16836) Hive date/time function error

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16836:


Assignee: Apache Spark

> Hive date/time function error
> -
>
> Key: SPARK-16836
> URL: https://issues.apache.org/jira/browse/SPARK-16836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jesse Lord
>Assignee: Apache Spark
>Priority: Minor
>
> Previously available hive functions for date/time are not available in Spark 
> 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 
> 1.6.2 with HiveContext.
> Example (from spark-shell):
> scala> spark.sql("select current_date")
> org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
> input columns: []; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16836) Hive date/time function error

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402635#comment-15402635
 ] 

Apache Spark commented on SPARK-16836:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14442

> Hive date/time function error
> -
>
> Key: SPARK-16836
> URL: https://issues.apache.org/jira/browse/SPARK-16836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jesse Lord
>Priority: Minor
>
> Previously available hive functions for date/time are not available in Spark 
> 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 
> 1.6.2 with HiveContext.
> Example (from spark-shell):
> scala> spark.sql("select current_date")
> org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
> input columns: []; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16836) Hive date/time function error

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16836:


Assignee: (was: Apache Spark)

> Hive date/time function error
> -
>
> Key: SPARK-16836
> URL: https://issues.apache.org/jira/browse/SPARK-16836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jesse Lord
>Priority: Minor
>
> Previously available hive functions for date/time are not available in Spark 
> 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 
> 1.6.2 with HiveContext.
> Example (from spark-shell):
> scala> spark.sql("select current_date")
> org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
> input columns: []; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-01 Thread Tom Magrino (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Magrino updated SPARK-16837:

Description: 
Right now, the constructors for the TimeWindow expression in Catalyst 
incorrectly uses the windowDuration in place of the slideDuration.  This will 
cause incorrect windowing semantics after time window expressions are analyzed 
by Catalyst.

Relevant code is here: 
https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54

  was:
Right now, the constructors for the TimeWindow expression in Catalyst 
incorrectly uses the windowDuration in place of the slideDuration.  This will 
cause incorrect windowing semantics the after time window expressions are 
analyzed by Catalyst.

Relevant code is here: 
https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54


> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16837:


Assignee: (was: Apache Spark)

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics the after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402600#comment-15402600
 ] 

Apache Spark commented on SPARK-16837:
--

User 'tmagrino' has created a pull request for this issue:
https://github.com/apache/spark/pull/14441

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics the after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16837:


Assignee: Apache Spark

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>Assignee: Apache Spark
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics the after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression

2016-08-01 Thread Colin Beckingham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402599#comment-15402599
 ] 

Colin Beckingham commented on SPARK-16768:
--

Sean said "If you mean the calling stack trace..." - well the information came 
from the PySpark Spark Jobs browser page, from the "Completed Jobs" in the 
"Description" column. In 1.6.2 the page is wonderfully reassuringly informing 
me that it is using L-BFGS, and in 2.1 it seems to be doing something else. 
Maybe it is not, it is doing exactly as required, in which case I stand 
corrected and now know how the 2.1 page should be interpreted differently. No 
problem.

> pyspark calls incorrect version of logistic regression
> --
>
> Key: SPARK-16768
> URL: https://issues.apache.org/jira/browse/SPARK-16768
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
> Environment: Linux openSUSE Leap 42.1 Gnome
>Reporter: Colin Beckingham
>
> PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()"  runs 
> "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark 
> 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized 
> version is much slower and produces a different answer from LBFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2016-08-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402592#comment-15402592
 ] 

holdenk commented on SPARK-16775:
-

Yes so my plan is to replace it with the new API in all of the places where I 
can - but in the places where that isn't reasonable (like places where the old 
accumulator API depends on its self) do some slight of hand with private 
internal methods to make the warnings go away.

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core, SQL
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >