[jira] [Assigned] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results
[ https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16849: Assignee: (was: Apache Spark) > Improve subquery execution by deduplicating the subqueries with the same > results > > > Key: SPARK-16849 > URL: https://issues.apache.org/jira/browse/SPARK-16849 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > The subqueries in SparkSQL will be run even they have the same physical plan > and output same results. We should be able to deduplicate these subqueries > which are referred in a query for many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results
[ https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16849: Assignee: Apache Spark > Improve subquery execution by deduplicating the subqueries with the same > results > > > Key: SPARK-16849 > URL: https://issues.apache.org/jira/browse/SPARK-16849 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > The subqueries in SparkSQL will be run even they have the same physical plan > and output same results. We should be able to deduplicate these subqueries > which are referred in a query for many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results
[ https://issues.apache.org/jira/browse/SPARK-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403394#comment-15403394 ] Apache Spark commented on SPARK-16849: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/14452 > Improve subquery execution by deduplicating the subqueries with the same > results > > > Key: SPARK-16849 > URL: https://issues.apache.org/jira/browse/SPARK-16849 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > The subqueries in SparkSQL will be run even they have the same physical plan > and output same results. We should be able to deduplicate these subqueries > which are referred in a query for many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16849) Improve subquery execution by deduplicating the subqueries with the same results
Liang-Chi Hsieh created SPARK-16849: --- Summary: Improve subquery execution by deduplicating the subqueries with the same results Key: SPARK-16849 URL: https://issues.apache.org/jira/browse/SPARK-16849 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh The subqueries in SparkSQL will be run even they have the same physical plan and output same results. We should be able to deduplicate these subqueries which are referred in a query for many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403380#comment-15403380 ] Xiao Li commented on SPARK-16842: - For each table, we just need to issue one query. That query will return an empty table. Normally, it is very cheap to most DBMS. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403376#comment-15403376 ] Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 5:21 AM: -- hm.. don't we make another connection and then run a query to fetch metadata for reading schema (separate query for fetching data)? This might be an overhead as much as touching a file. was (Author: hyukjin.kwon): hm.. don't we make a connection and then run a query to fetch metadata for reading schema? This might be an overhead as much as touching a file. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403376#comment-15403376 ] Hyukjin Kwon commented on SPARK-16842: -- hm.. don't we make a connection and then run a query to fetch metadata for reading schema? This might be an overhead as much as touching a file. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
[ https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403364#comment-15403364 ] Apache Spark commented on SPARK-16848: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14451 > Make jdbc() and read.format("jdbc") consistently throwing exception for > user-specified schema > - > > Key: SPARK-16848 > URL: https://issues.apache.org/jira/browse/SPARK-16848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, > {code} > spark.read.schema(StructType(Seq())).jdbc(...),show() > {code} > does not throws an exception whereas > {code} > spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show() > {code} > does as below: > {code} > jdbc does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified > schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at > org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351) > {code} > It'd make sense throwing the exception when user specifies schema identically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
[ https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16848: Assignee: Apache Spark > Make jdbc() and read.format("jdbc") consistently throwing exception for > user-specified schema > - > > Key: SPARK-16848 > URL: https://issues.apache.org/jira/browse/SPARK-16848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > Currently, > {code} > spark.read.schema(StructType(Seq())).jdbc(...),show() > {code} > does not throws an exception whereas > {code} > spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show() > {code} > does as below: > {code} > jdbc does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified > schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at > org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351) > {code} > It'd make sense throwing the exception when user specifies schema identically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
[ https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16848: Assignee: (was: Apache Spark) > Make jdbc() and read.format("jdbc") consistently throwing exception for > user-specified schema > - > > Key: SPARK-16848 > URL: https://issues.apache.org/jira/browse/SPARK-16848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, > {code} > spark.read.schema(StructType(Seq())).jdbc(...),show() > {code} > does not throws an exception whereas > {code} > spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show() > {code} > does as below: > {code} > jdbc does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified > schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at > org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351) > {code} > It'd make sense throwing the exception when user specifies schema identically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403358#comment-15403358 ] Xiao Li commented on SPARK-16842: - I heard of a case. In one big Internet company, their use case could generate many small parquet files. They complaint the performance is slow > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403354#comment-15403354 ] Xiao Li commented on SPARK-16842: - The overhead of schema parsing in JDBC is small, right? > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
Hyukjin Kwon created SPARK-16848: Summary: Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema Key: SPARK-16848 URL: https://issues.apache.org/jira/browse/SPARK-16848 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Trivial Currently, {code} spark.read.schema(StructType(Seq())).jdbc(...),show() {code} does not throws an exception whereas {code} spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show() {code} does as below: {code} jdbc does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351) {code} It'd make sense throwing the exception when user specifies schema identically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16847: Assignee: Apache Spark > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16847: Assignee: (was: Apache Spark) > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403337#comment-15403337 ] Apache Spark commented on SPARK-16847: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14450 > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16843: Assignee: (was: Apache Spark) > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1540#comment-1540 ] Apache Spark commented on SPARK-16843: -- User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/14449 > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16843: Assignee: Apache Spark > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Assignee: Apache Spark >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16847: - Summary: Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader (was: Do not read Parquet corrupt statstics on binary via VectorizedReader when it is corrupt) > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16847) Do not read Parquet corrupt statstics on binary via VectorizedReader when it is corrupt
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16847: - Summary: Do not read Parquet corrupt statstics on binary via VectorizedReader when it is corrupt (was: Do not read Parquet corrupt statstics on binary ) > Do not read Parquet corrupt statstics on binary via VectorizedReader when it > is corrupt > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16847) Do not read Parquet corrupt statstics on binary
Hyukjin Kwon created SPARK-16847: Summary: Do not read Parquet corrupt statstics on binary Key: SPARK-16847 URL: https://issues.apache.org/jira/browse/SPARK-16847 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor It is still possible to read corrupt Parquet's statistics. This problem was found in PARQUET-251 and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there are potential incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403315#comment-15403315 ] Hyukjin Kwon commented on SPARK-16842: -- If we don't support schema compatibility but should support user-specified schema (like throwing an exception while executing if the schema is wrong), we can enable JDBC to accept schema as well because there is an overhead to parse the schema in JDBC as well. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308 ] Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 3:56 AM: -- Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. Also, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* imply not supporting schema compatibility. In that case, this one might be an option. was (Author: hyukjin.kwon): Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. Also, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308 ] Hyukjin Kwon edited comment on SPARK-16842 at 8/2/16 3:56 AM: -- Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. Also, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. was (Author: hyukjin.kwon): Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. So, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403308#comment-15403308 ] Hyukjin Kwon commented on SPARK-16842: -- Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. So, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16842: - Comment: was deleted (was: Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. So, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. ) > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403309#comment-15403309 ] Hyukjin Kwon commented on SPARK-16842: -- Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. So, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15939) Clarify ml.linalg usage
[ https://issues.apache.org/jira/browse/SPARK-15939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-15939. Resolution: Not A Problem > Clarify ml.linalg usage > --- > > Key: SPARK-15939 > URL: https://issues.apache.org/jira/browse/SPARK-15939 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng >Priority: Trivial > > 1, update comments in {{pyspark.ml}} that it use {{ml.linalg}} not > {{mllib.linalg}} > 2, rename {{MLlibTestCase}} to {{MLTestCase}} in {{ml.tests.py}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-16843: -- Fix Version/s: (was: 2.0.1) 2.1.0 > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-16843: -- Target Version/s: (was: 2.0.1) > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng > Fix For: 2.0.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-16843: -- Priority: Minor (was: Major) > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Priority: Minor > Fix For: 2.0.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-16843: -- Affects Version/s: (was: 2.0.0) 2.1.0 > Select features according to a percentile of the highest scores of > ChiSqSelector > > > Key: SPARK-16843 > URL: https://issues.apache.org/jira/browse/SPARK-16843 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Peng Meng >Priority: Minor > Fix For: 2.0.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > It would be handy to add a percentile Param to ChiSqSelector, as in the > scikit-learn one: > http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16846) read.csv() option: "inferSchema" don't work
hejie created SPARK-16846: - Summary: read.csv() option: "inferSchema" don't work Key: SPARK-16846 URL: https://issues.apache.org/jira/browse/SPARK-16846 Project: Spark Issue Type: Bug Reporter: hejie I use the code to read file and get a dataframe. When the colum number is olny 20, the inferSchema paragrama work well. But when the number is up to 400, it doesn't work, and I have to tell it the schema manually. the code is : val df = spark.read.schema(schema).options(Map("header"->"true","quote"->",","inferSchema"->"true")).csv("/Users/ss/Documents/traindata/traindataAllNumber.csv") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16818: Fix Version/s: 2.0.1 > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403262#comment-15403262 ] Sylvain Zimmer commented on SPARK-16826: [~srowen] what about this? https://github.com/sylvinus/spark/commit/98119a08368b1cd1faf3f25a32910ad6717c5c02 The tests seem to pass and I don't think it uses the problematic code paths in java.net.URL (except for getFile but that may be could be fixed easily) > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16579) Add a spark install function
[ https://issues.apache.org/jira/browse/SPARK-16579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403256#comment-15403256 ] Apache Spark commented on SPARK-16579: -- User 'junyangq' has created a pull request for this issue: https://github.com/apache/spark/pull/14448 > Add a spark install function > > > Key: SPARK-16579 > URL: https://issues.apache.org/jira/browse/SPARK-16579 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > As described in the design doc we need to introduce a function to install > Spark in case the user directly downloads SparkR from CRAN. > To do that we can introduce a install_spark function that takes in the > following arguments > {code} > hadoop_version > url_to_use # defaults to apache > local_dir # defaults to a cache dir > {code} > Further more I think we can automatically run this from sparkR.init if we > find Spark home and the JARs missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403255#comment-15403255 ] Xiao Li commented on SPARK-16842: - When users specify the schema, we do not need to discover the schema, right? Schema discovery could be very time consuming. Thus, IMO, it is still reasonable to let users specify the schema. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403247#comment-15403247 ] Tao Wang commented on SPARK-14559: -- Hi [~zsxwing], Sadly the application is ended now so i can't get the thread info :( but i can be sure the AM is ok at the moment(even the 2 attempts both failed as too many executor failed). another point is that after AM attempt 1 failed attempt 2 started at 11:30, but the RegisterClusterManager message is handled by driver at around at 18:30. the dispatch thread which handle the RegisterClusterManager message in thread pool(40 thread in total) is busy all the time while some other threads are idle. So we doubt if some logic in dispatching message has some corner case for us to cover. This is all we can get from the log. If you need other information i will try to find them in logs which are we all have :( > Netty RPC didn't check channel is active before sending message > --- > > Key: SPARK-14559 > URL: https://issues.apache.org/jira/browse/SPARK-14559 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65 >Reporter: cen yuhai > > I have a long-running service. After running for serveral hours, It throwed > these exceptions. I found that before sending rpc request by calling sendRpc > method in TransportClient, there is no check that whether the channel is > still open or active ? > java.nio.channels.ClosedChannelException > 4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 5635696155204230556 to > bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio. > channels.ClosedChannelException > 4866 java.nio.channels.ClosedChannelException > 4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 7319486003318455703 to > bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio. > channels.ClosedChannelException > 4868 java.nio.channels.ClosedChannelException > 4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9041854451893215954 to > bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio. > channels.ClosedChannelException > 4870 java.nio.channels.ClosedChannelException > 4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 6046473497871624501 to > bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio. > channels.ClosedChannelException > 4872 java.nio.channels.ClosedChannelException > 4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9085605650438705047 to > bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio. > channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238 ] Sean Zhong edited comment on SPARK-16320 at 8/2/16 2:22 AM: [~maver1ck] Can you check whether the PR works for you? was (Author: clockfly): [~loziniak] Can you check whether the PR works for you? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403238#comment-15403238 ] Sean Zhong commented on SPARK-16320: [~loziniak] Can you check whether the PR works for you? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403236#comment-15403236 ] Sylvain Zimmer commented on SPARK-16826: Sorry I can't be more helpful on the Java side... But I think there must be some high-quality URL parsing code somewhere in the Apache foundation already :-) > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
hejie created SPARK-16845: - Summary: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB Key: SPARK-16845 URL: https://issues.apache.org/jira/browse/SPARK-16845 Project: Spark Issue Type: Bug Components: Java API, ML, MLlib Affects Versions: 2.0.0 Reporter: hejie I have a wide table(400 columns), when I try fitting the traindata on all columns, the fatal error occurs. ... 46 more Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403222#comment-15403222 ] Sean Owen commented on SPARK-16826: --- URI.toURL just follows the same code path. Does URI itself parse all the same fields? Didn't think so because URIs are a superset of URLs. Definitely open to suggestions. Anything that can parse the same fields respectably is OK. > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16844) Generate code for sort based aggregation
[ https://issues.apache.org/jira/browse/SPARK-16844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403202#comment-15403202 ] yucai commented on SPARK-16844: --- We are working on the whole stage code gen for the sort based aggregation. PR and test report will be sent soon. > Generate code for sort based aggregation > > > Key: SPARK-16844 > URL: https://issues.apache.org/jira/browse/SPARK-16844 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: yucai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16844) Generate code for sort based aggregation
yucai created SPARK-16844: - Summary: Generate code for sort based aggregation Key: SPARK-16844 URL: https://issues.apache.org/jira/browse/SPARK-16844 Project: Spark Issue Type: New Feature Components: SQL Reporter: yucai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16843) Select features according to a percentile of the highest scores of ChiSqSelector
Peng Meng created SPARK-16843: - Summary: Select features according to a percentile of the highest scores of ChiSqSelector Key: SPARK-16843 URL: https://issues.apache.org/jira/browse/SPARK-16843 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 2.0.0 Reporter: Peng Meng Fix For: 2.0.1 It would be handy to add a percentile Param to ChiSqSelector, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403165#comment-15403165 ] Sylvain Zimmer edited comment on SPARK-16826 at 8/2/16 1:15 AM: [~srowen] thanks for the pointers! I'm parsing every hyperlink found in Common Crawl, so there are billions of unique ones, no way around it. Wouldn't it be possible to switch to another implementation with an API similar to java.net.URL? As I understand it we never need the URLStreamHandler in the first place anyway? I'm not a Java expert but what about {{java.net.URI}} or {{org.apache.catalina.util.URL}} for instance? was (Author: sylvinus): [~srowen] thanks for the pointers! I'm parsing every hyperlink found in Common Crawl, so there are billions of unique ones, no way around it. Wouldn't it be possible to switch to another implementation with an API similar to java.net.URL? As I understand it we never need the URLStreamHandler in the first place anyway? I'm not a Java expert but what about {java.net.URI} or {org.apache.catalina.util.URL} for instance? > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403165#comment-15403165 ] Sylvain Zimmer commented on SPARK-16826: [~srowen] thanks for the pointers! I'm parsing every hyperlink found in Common Crawl, so there are billions of unique ones, no way around it. Wouldn't it be possible to switch to another implementation with an API similar to java.net.URL? As I understand it we never need the URLStreamHandler in the first place anyway? I'm not a Java expert but what about {java.net.URI} or {org.apache.catalina.util.URL} for instance? > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16842: - Description: If my understanding is correct, If the user-given schema is different with the inferred schema, it is handled differently for each datasource. - For JSON and CSV it is kind of permissive generally (for example, compatibility among numeric types). - For ORC and Parquet Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278) - For Text it only supports {{StringType}}. - For JDBC it does not take user-given schema since it does not implement {{SchemaRelationProvider}}. By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive schema. To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. So, we might have to just disallow giving user-given schema for Parquet and Orc. Actually, we can't give a different schema for Orc and Parquet almost at all times if my understanding it correct. was: If my understanding is correct, If the user-given schema is different with the inferred schema, it is handled differently for each datasource. - For JSON and CSV it is kind of permissive generally (for example, compatibility among numeric types). - For ORC and Parquet Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278) - For Text it only supports {{StringType}}. - For JDBC it does not take user-given schema since it does not implement {{SchemaRelationProvider}}. By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive schema. To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. So, we might have to just disallow giving user-given schema. Actually, we can't give a different schema for Orc and Parquet almost at all times if my understanding it correct. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16842: - Description: If my understanding is correct, If the user-given schema is different with the inferred schema, it is handled differently for each datasource. - For JSON and CSV it is kind of permissive generally (for example, compatibility among numeric types). - For ORC and Parquet Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278) - For Text it only supports {{StringType}}. - For JDBC it does not take user-given schema since it does not implement {{SchemaRelationProvider}}. By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive schema. To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. So, we might have to just disallow giving user-given schema. Actually, we can't give a different schema for Orc and Parquet almost at all times if my understanding it correct. was: If my understanding is correct, If the user-given schema is different with the inferred schema, it is handled differently for each datasource. - For JSON and CSV it is kind of permissive generally (for example, compatibility among numeric types). - For ORC and Parquet Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278) - For Text it only supports `StringType`. - For JDBC it does not take user-given schema since it does not implement `SchemaRelationProvider`. By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably permissive schema. To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. So, we might have to just disallow giving user-given schema. Actually, we can't give schemas for Orc and Parquet almost at all times if my understanding it correct. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema. Actually, we > can't give a different schema for Orc and Parquet almost at all times if my > understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403139#comment-15403139 ] Hyukjin Kwon commented on SPARK-16842: -- Let me cc [~liancheng], [~smilegator] [~dongjoon] and [~cloud_fan] who I think are related with this JIRA. > Concern about disallowing user-given schema for Parquet and ORC > --- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports `StringType`. > - For JDBC > it does not take user-given schema since it does not implement > `SchemaRelationProvider`. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema. Actually, we > can't give schemas for Orc and Parquet almost at all times if my > understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403137#comment-15403137 ] Apache Spark commented on SPARK-16445: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/14447 > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16828) remove MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-16828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16828. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14434 [https://github.com/apache/spark/pull/14434] > remove MaxOf and MinOf > -- > > Key: SPARK-16828 > URL: https://issues.apache.org/jira/browse/SPARK-16828 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC
Hyukjin Kwon created SPARK-16842: Summary: Concern about disallowing user-given schema for Parquet and ORC Key: SPARK-16842 URL: https://issues.apache.org/jira/browse/SPARK-16842 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon If my understanding is correct, If the user-given schema is different with the inferred schema, it is handled differently for each datasource. - For JSON and CSV it is kind of permissive generally (for example, compatibility among numeric types). - For ORC and Parquet Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278) - For Text it only supports `StringType`. - For JDBC it does not take user-given schema since it does not implement `SchemaRelationProvider`. By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably permissive schema. To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. So, we might have to just disallow giving user-given schema. Actually, we can't give schemas for Orc and Parquet almost at all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403116#comment-15403116 ] Charles Allen commented on SPARK-16798: --- Yep, still happens: {code} 16/08/02 00:41:17 INFO HadoopRDD: Input split: REDACTED.gz:0+7389144 16/08/02 00:41:17 INFO TorrentBroadcast: Started reading broadcast variable 0 16/08/02 00:41:17 INFO TransportClientFactory: Successfully created connection to /<> after 1 ms (0 ms spent in bootstraps) 16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 18.2 KB, free 3.6 GB) 16/08/02 00:41:17 INFO TorrentBroadcast: Reading broadcast variable 0 took 34 ms 16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 209.2 KB, free 3.6 GB) 16/08/02 00:41:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 16/08/02 00:41:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 16/08/02 00:41:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/08/02 00:41:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/08/02 00:41:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/08/02 00:41:18 INFO NativeS3FileSystem: Opening 'REDACTED.gz' for reading 16/08/02 00:41:18 INFO CodecPool: Got brand-new decompressor [.gz] 16/08/02 00:41:19 ERROR Executor: Exception in task 11.0 in stage 0.0 (TID 11) java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403110#comment-15403110 ] Miao Wang commented on SPARK-16802: --- With latest code, it should have been fixed. I re-run the test code for 10+ mintues. > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)
[jira] [Commented] (SPARK-16832) CrossValidator and TrainValidationSplit are not random without seed
[ https://issues.apache.org/jira/browse/SPARK-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403067#comment-15403067 ] Bryan Cutler commented on SPARK-16832: -- The default seed value is a constant, this is the trait where it is assigned [here|https://github.com/apache/spark/blob/v2.0.0/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L310]. In order to get the behavior you want, you would need to explicitly set the seed for each run with unique values, as you mentioned. > CrossValidator and TrainValidationSplit are not random without seed > --- > > Key: SPARK-16832 > URL: https://issues.apache.org/jira/browse/SPARK-16832 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > Repeatedly running CrossValidator or TrainValidationSplit without an explicit > seed parameter does not change results. It is supposed to be seeded with a > random seed, but it seems to be instead seeded with some constant. (If seed > is explicitly provided, the two classes behave as expected.) > {code} > dataset = spark.createDataFrame( > [(Vectors.dense([0.0]), 0.0), >(Vectors.dense([0.4]), 1.0), >(Vectors.dense([0.5]), 0.0), >(Vectors.dense([0.6]), 1.0), >(Vectors.dense([1.0]), 1.0)] * 1000, > ["features", "label"]).cache() > paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() > tvs = > pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), > >estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), >trainRatio=0.8) > model = tvs.fit(train) > print(model.validationMetrics) > for folds in (3, 5, 10): > cv = > pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(), > > estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), > numFolds=folds > ) > cvModel = cv.fit(dataset) > print(folds, cvModel.avgMetrics) > {code} > This code produces identical results upon repeated calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16841: Assignee: Apache Spark > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Assignee: Apache Spark > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403053#comment-15403053 ] Apache Spark commented on SPARK-16841: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14446 > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16841: Assignee: (was: Apache Spark) > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-16841: --- Summary: Improves the row level metrics performance when reading Parquet table (was: Improve the row level metrics performance when reading Parquet table) > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16841) Improve the row level metrics performance when reading Parquet table
Sean Zhong created SPARK-16841: -- Summary: Improve the row level metrics performance when reading Parquet table Key: SPARK-16841 URL: https://issues.apache.org/jira/browse/SPARK-16841 Project: Spark Issue Type: Improvement Reporter: Sean Zhong When reading Parquet table, Spark adds row level metrics like recordsRead, bytesRead (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). The implementation is not very efficient. When parquet vectorized reader is not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-16802: -- Assignee: Davies Liu > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402961#comment-15402961 ] Apache Spark commented on SPARK-16320: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14445 > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16320: Assignee: (was: Apache Spark) > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16320: Assignee: Apache Spark > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Apache Spark >Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16840) Please save the aggregate term frequencies as part of the NaiveBayesModel
Barry Becker created SPARK-16840: Summary: Please save the aggregate term frequencies as part of the NaiveBayesModel Key: SPARK-16840 URL: https://issues.apache.org/jira/browse/SPARK-16840 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.0.0, 1.6.2 Reporter: Barry Becker I would like to visualize the structure of the NaiveBayes model in order to get additional insight into the patterns in the data. In order to do that I need the frequencies for each feature value per label. This exact information is computed in the NaiveBayes.run method (see "aggregated" variable), but then discarded when creating the model. Pi and theta are computed based on the aggregated frequency counts, but surprisingly those counts are not needed to apply the model. It would not add much to the model size to add these aggregated counts, but could be very useful for some applications of the model. {code} def run(data: RDD[LabeledPoint]): NaiveBayesModel = { : // Aggregates term frequencies per label. val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)]( createCombiner = (v: Vector) => { : }, : new NaiveBayesModel(labels, pi, theta, modelType) // <- please include "aggregated" here. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state
[ https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402920#comment-15402920 ] Apache Spark commented on SPARK-16839: -- User 'eyalfa' has created a pull request for this issue: https://github.com/apache/spark/pull/1 > CleanupAliases may leave redundant aliases at end of analysis state > --- > > Key: SPARK-16839 > URL: https://issues.apache.org/jira/browse/SPARK-16839 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Eyal Farago >Priority: Minor > Labels: alias, analysis, analyzers, sql, struct > Original Estimate: 72h > Remaining Estimate: 72h > > [SPARK-9634] [SPARK-9323] [SQL] introduced CleanupReferences which removes > unnecessary Aliases while keeping required ones such as top level Projection > and struct attributes. this mechanism is implemented by maintaining a boolean > flag during a top-down expression transformation, I found a case where this > mechanism leaves redundant aliases in the tree (within a right sibling of a > create_struct node). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state
[ https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16839: Assignee: (was: Apache Spark) > CleanupAliases may leave redundant aliases at end of analysis state > --- > > Key: SPARK-16839 > URL: https://issues.apache.org/jira/browse/SPARK-16839 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Eyal Farago >Priority: Minor > Labels: alias, analysis, analyzers, sql, struct > Original Estimate: 72h > Remaining Estimate: 72h > > [SPARK-9634] [SPARK-9323] [SQL] introduced CleanupReferences which removes > unnecessary Aliases while keeping required ones such as top level Projection > and struct attributes. this mechanism is implemented by maintaining a boolean > flag during a top-down expression transformation, I found a case where this > mechanism leaves redundant aliases in the tree (within a right sibling of a > create_struct node). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state
[ https://issues.apache.org/jira/browse/SPARK-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16839: Assignee: Apache Spark > CleanupAliases may leave redundant aliases at end of analysis state > --- > > Key: SPARK-16839 > URL: https://issues.apache.org/jira/browse/SPARK-16839 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Eyal Farago >Assignee: Apache Spark >Priority: Minor > Labels: alias, analysis, analyzers, sql, struct > Original Estimate: 72h > Remaining Estimate: 72h > > [SPARK-9634] [SPARK-9323] [SQL] introduced CleanupReferences which removes > unnecessary Aliases while keeping required ones such as top level Projection > and struct attributes. this mechanism is implemented by maintaining a boolean > flag during a top-down expression transformation, I found a case where this > mechanism leaves redundant aliases in the tree (within a right sibling of a > create_struct node). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16839) CleanupAliases may leave redundant aliases at end of analysis state
Eyal Farago created SPARK-16839: --- Summary: CleanupAliases may leave redundant aliases at end of analysis state Key: SPARK-16839 URL: https://issues.apache.org/jira/browse/SPARK-16839 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.1 Reporter: Eyal Farago Priority: Minor [SPARK-9634] [SPARK-9323] [SQL] introduced CleanupReferences which removes unnecessary Aliases while keeping required ones such as top level Projection and struct attributes. this mechanism is implemented by maintaining a boolean flag during a top-down expression transformation, I found a case where this mechanism leaves redundant aliases in the tree (within a right sibling of a create_struct node). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15869) HTTP 500 and NPE on streaming batch details page
[ https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15869. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.1.0 2.0.1 > HTTP 500 and NPE on streaming batch details page > > > Key: SPARK-15869 > URL: https://issues.apache.org/jira/browse/SPARK-15869 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > > When I'm trying to show details of streaming batch I'm getting NPE. > Sample link: > http://127.0.0.1:4040/streaming/batch/?id=146555370 > Error: > {code} > HTTP ERROR 500 > Problem accessing /streaming/batch/. Reason: > Server Error > Caused by: > java.lang.NullPointerException > at > scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320) > at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104) > at > org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273) > at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores
[ https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402847#comment-15402847 ] Sean Owen commented on SPARK-16834: --- Hm, I see. Is it due to the bug you found in https://issues.apache.org/jira/browse/SPARK-16831 causing the metrics to almost always be too large for the CrossValidationModel? Why use different data sets in both cases? to make this a direct comparison, train both on the same data, and eval on the same set. > TrainValildationSplit and direct evaluation produce different scores > > > Key: SPARK-16834 > URL: https://issues.apache.org/jira/browse/SPARK-16834 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz > > The two segments of code below are supposed to do the same thing: one is > using TrainValidationSplit, the other performs the same evaluation manually. > However, their results are statistically different (in my case, in a loop of > 20, I regularly get ~19 True values). > Unfortunately, I didn't find the bug in the source code. > {code} > dataset = spark.createDataFrame( > [(Vectors.dense([0.0]), 0.0), >(Vectors.dense([0.4]), 1.0), >(Vectors.dense([0.5]), 0.0), >(Vectors.dense([0.6]), 1.0), >(Vectors.dense([1.0]), 1.0)] * 1000, > ["features", "label"]).cache() > paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() > # note that test is NEVER used in this code > # I create it only to utilize randomSplit > for i in range(20): > train, test = dataset.randomSplit([0.8, 0.2]) > tvs = > pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), > > estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), > trainRatio=0.5) > model = tvs.fit(train) > train, val, test = dataset.randomSplit([0.4, 0.4, 0.2]) > lr=pyspark.ml.regression.LinearRegression() > evaluator=pyspark.ml.evaluation.RegressionEvaluator() > lrModel = lr.fit(train) > predicted = lrModel.transform(val) > print(model.validationMetrics[0] < evaluator.evaluate(predicted)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data
[ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16548. --- Resolution: Won't Fix > java.io.CharConversionException: Invalid UTF-32 character prevents me from > querying my data > > > Key: SPARK-16548 > URL: https://issues.apache.org/jira/browse/SPARK-16548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Egor Pahomov >Priority: Minor > > Basically, when I query my json data I get > {code} > java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above > 10) at char #192, byte #771) > at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571) > at > org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142) > {code} > I do not like it. If you can not process one json among 100500 please return > null, do not fail everything. I have dirty one line fix, and I understand how > I can make it more reasonable. What is our position - what behaviour we wanna > get? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16495) Add ADMM optimizer in mllib package
[ https://issues.apache.org/jira/browse/SPARK-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16495. --- Resolution: Later > Add ADMM optimizer in mllib package > --- > > Key: SPARK-16495 > URL: https://issues.apache.org/jira/browse/SPARK-16495 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: zunwen you > > Alternating Direction Method of Multipliers (ADMM) is well suited to > distributed convex optimization, and in particular to large-scale problems > arising in statistics, machine learning, and related areas. > Details can be found in the [S. Boyd's > paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16465) Add nonnegative flag to mllib ALS
[ https://issues.apache.org/jira/browse/SPARK-16465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16465. --- Resolution: Won't Fix > Add nonnegative flag to mllib ALS > - > > Key: SPARK-16465 > URL: https://issues.apache.org/jira/browse/SPARK-16465 > Project: Spark > Issue Type: New Feature >Reporter: Roberto Pagliari >Priority: Minor > > Currently, this flag is available in ml, not in mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16801) clearThreshold does not work for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16801. --- Resolution: Not A Problem > clearThreshold does not work for SparseVector > - > > Key: SPARK-16801 > URL: https://issues.apache.org/jira/browse/SPARK-16801 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.2 >Reporter: Rahul Shah >Priority: Minor > > LogisticRegression model of mllib library performs randomly when passed with > an SparseVector instead of DenseVector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16774: -- Assignee: holdenk Priority: Minor (was: Major) > Fix use of deprecated TimeStamp constructor (also providing incorrect results) > -- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does not handle > DST boundaries correctly all the time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7445) StringIndexer should handle binary labels properly
[ https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402786#comment-15402786 ] Ruben Janssen edited comment on SPARK-7445 at 8/1/16 8:57 PM: -- I'd be interested to work on this. Before I start however, just to clarify: 'Another option is to allow users to provide a list or labels and we use the ordering.' I think you mean 'Another option is to allow users to provide a list OF labels and use THAT GIVEN ordering.'? If that is the case, I would advocate the latter because having binary labels does not necessarily imply that we have negatives and positives. We could have "left"/"right" for example. This would also be more flexible for the users and does not have to be limited to just binary labels. was (Author: rubenjanssen): I'd be interested to work on this. Before I start however, just to clarify: 'Another option is to allow users to provide a list or labels and we use the ordering.' I think you mean 'Another option is to allow users to provide a list OF labels and use THAT GIVEN ordering.'? If that is the case, I would advocate that because having binary labels does not necessarily imply that we have negatives and positives. We could have "left"/"right" for example. This would also be more flexible for the users and does not have to be limited to just binary labels. > StringIndexer should handle binary labels properly > -- > > Key: SPARK-7445 > URL: https://issues.apache.org/jira/browse/SPARK-7445 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Priority: Minor > > StringIndexer orders labels by their counts. However, for binary labels, we > should really map negatives to 0 and positive to 1. So can put special rules > for binary labels: > 1. "+1"/"-1", "1"/"-1", "1"/"0" > 2. "yes"/"no" > 3. "true"/"false" > Another option is to allow users to provide a list or labels and we use the > ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16774) Fix use of deprecated TimeStamp constructor (also providing incorrect results)
[ https://issues.apache.org/jira/browse/SPARK-16774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16774. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14398 [https://github.com/apache/spark/pull/14398] > Fix use of deprecated TimeStamp constructor (also providing incorrect results) > -- > > Key: SPARK-16774 > URL: https://issues.apache.org/jira/browse/SPARK-16774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > Fix For: 2.0.1, 2.1.0 > > > The TimeStamp constructor we use inside of DateTime utils has been deprecated > since JDK 1.1 - while Java does take a long time to remove deprecated > functionality we might as well address this. Additionally it does not handle > DST boundaries correctly all the time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7445) StringIndexer should handle binary labels properly
[ https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402786#comment-15402786 ] Ruben Janssen commented on SPARK-7445: -- I'd be interested to work on this. Before I start however, just to clarify: 'Another option is to allow users to provide a list or labels and we use the ordering.' I think you mean 'Another option is to allow users to provide a list OF labels and use THAT GIVEN ordering.'? If that is the case, I would advocate that because having binary labels does not necessarily imply that we have negatives and positives. We could have "left"/"right" for example. This would also be more flexible for the users and does not have to be limited to just binary labels. > StringIndexer should handle binary labels properly > -- > > Key: SPARK-7445 > URL: https://issues.apache.org/jira/browse/SPARK-7445 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Priority: Minor > > StringIndexer orders labels by their counts. However, for binary labels, we > should really map negatives to 0 and positive to 1. So can put special rules > for binary labels: > 1. "+1"/"-1", "1"/"-1", "1"/"0" > 2. "yes"/"no" > 3. "true"/"false" > Another option is to allow users to provide a list or labels and we use the > ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402739#comment-15402739 ] Davies Liu commented on SPARK-16700: There are two separate problems here: 1) Spark 2.0 enforce data type checking when creating a DataFrame, it's safer but slower. It makes sense to have a flag for that (on by default) 2) Row object is similar to named tuple (not dict), the columns are ordered. When it's created in a way like dict, we have no way to know the order of columns, so they are sorted by name, then it does not match with the schema provided. We should check the schema (order of columns) when create a DataFrame from RDD of Row (we assume they matched) > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15869) HTTP 500 and NPE on streaming batch details page
[ https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15869: Assignee: Apache Spark > HTTP 500 and NPE on streaming batch details page > > > Key: SPARK-15869 > URL: https://issues.apache.org/jira/browse/SPARK-15869 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > When I'm trying to show details of streaming batch I'm getting NPE. > Sample link: > http://127.0.0.1:4040/streaming/batch/?id=146555370 > Error: > {code} > HTTP ERROR 500 > Problem accessing /streaming/batch/. Reason: > Server Error > Caused by: > java.lang.NullPointerException > at > scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320) > at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104) > at > org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273) > at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15869) HTTP 500 and NPE on streaming batch details page
[ https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15869: Assignee: (was: Apache Spark) > HTTP 500 and NPE on streaming batch details page > > > Key: SPARK-15869 > URL: https://issues.apache.org/jira/browse/SPARK-15869 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > > When I'm trying to show details of streaming batch I'm getting NPE. > Sample link: > http://127.0.0.1:4040/streaming/batch/?id=146555370 > Error: > {code} > HTTP ERROR 500 > Problem accessing /streaming/batch/. Reason: > Server Error > Caused by: > java.lang.NullPointerException > at > scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320) > at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104) > at > org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273) > at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15869) HTTP 500 and NPE on streaming batch details page
[ https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402730#comment-15402730 ] Apache Spark commented on SPARK-15869: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/14443 > HTTP 500 and NPE on streaming batch details page > > > Key: SPARK-15869 > URL: https://issues.apache.org/jira/browse/SPARK-15869 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > > When I'm trying to show details of streaming batch I'm getting NPE. > Sample link: > http://127.0.0.1:4040/streaming/batch/?id=146555370 > Error: > {code} > HTTP ERROR 500 > Problem accessing /streaming/batch/. Reason: > Server Error > Caused by: > java.lang.NullPointerException > at > scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320) > at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104) > at > org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273) > at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81) > at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16792) Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)
[ https://issues.apache.org/jira/browse/SPARK-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-16792: - Component/s: (was: Spark Core) SQL > Dataset containing a Case Class with a List type causes a CompileException > (converting sequence to list) > > > Key: SPARK-16792 > URL: https://issues.apache.org/jira/browse/SPARK-16792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jamie Hutton >Priority: Critical > > The issue occurs when we run a .map over a dataset containing Case Class with > a List in it. A self contained test case is below: > case class TestCC(key: Int, letters: List[String]) //List causes the issue - > a Seq/Array works fine > /*simple test data*/ > val ds1 = sc.makeRDD(Seq( > (List("D")), > (List("S","H")), > (List("F","H")), > (List("D","L","L")) > )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC] > //This will fail > val test1=ds1.map{_.key} > test1.show > Error: > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 72, Column 70: No applicable constructor/method found > for actual parameters "int, scala.collection.Seq"; candidates are: > "TestCC(int, scala.collection.immutable.List)" > It seems to be internally converting the List to a sequence, then it cant > convert it back... > If you change the List[String] to Seq[String] or Array[String] the issue > doesnt appear -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402712#comment-15402712 ] Shixiong Zhu commented on SPARK-14559: -- [~WangTao] Could you check the AM process? Looks like it's down. If it's still alive, could you provide the thread dump, please? > Netty RPC didn't check channel is active before sending message > --- > > Key: SPARK-14559 > URL: https://issues.apache.org/jira/browse/SPARK-14559 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65 >Reporter: cen yuhai > > I have a long-running service. After running for serveral hours, It throwed > these exceptions. I found that before sending rpc request by calling sendRpc > method in TransportClient, there is no check that whether the channel is > still open or active ? > java.nio.channels.ClosedChannelException > 4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 5635696155204230556 to > bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio. > channels.ClosedChannelException > 4866 java.nio.channels.ClosedChannelException > 4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 7319486003318455703 to > bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio. > channels.ClosedChannelException > 4868 java.nio.channels.ClosedChannelException > 4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9041854451893215954 to > bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio. > channels.ClosedChannelException > 4870 java.nio.channels.ClosedChannelException > 4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 6046473497871624501 to > bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio. > channels.ClosedChannelException > 4872 java.nio.channels.ClosedChannelException > 4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC > 9085605650438705047 to > bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio. > channels.ClosedChannelException -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16836) Hive date/time function error
[ https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16836: Description: Previously available hive functions for date/time are not available in Spark 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 1.6.2 with HiveContext. Example (from spark-shell): {noformat} scala> spark.sql("select current_date") org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given input columns: []; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) ... 48 elided {noformat} was: Previously available hive functions for date/time are not available in Spark 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 1.6.2 with HiveContext. Example (from spark-shell): scala> spark.sql("select current_date") org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given input columns: []; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at
[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676 ] Charles Allen edited comment on SPARK-16798 at 8/1/16 7:30 PM: --- Minor update. Due to library collisions I have to change around how some of the tagging works internally. I'm cutting an internal-only (MMX) release of https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f with a new tag/version to prevent potential version conflicts in our infrastructure. Didn't want to mess with it over the weekend so new build is making its way through now. was (Author: drcrallen): Minor update. Due to library collisions I have to change around how some of the tagging works internally. I'm cutting an internal-only release of https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f with a new tag/version to prevent potential version conflicts in our infrastructure. Didn't want to mess with it over the weekend so new build is making its way through now. > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676 ] Charles Allen commented on SPARK-16798: --- Minor update. Due to library collisions I have to change around how some of the tagging works internally. I'm cutting an internal-only release of https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f with a new tag/version to prevent potential version conflicts in our infrastructure. Didn't want to mess with it over the weekend so new build is making its way through now. > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16836) Hive date/time function error
[ https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16836: Assignee: Apache Spark > Hive date/time function error > - > > Key: SPARK-16836 > URL: https://issues.apache.org/jira/browse/SPARK-16836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jesse Lord >Assignee: Apache Spark >Priority: Minor > > Previously available hive functions for date/time are not available in Spark > 2.0 (e.g. current_date, current_timestamp). These functions work in Spark > 1.6.2 with HiveContext. > Example (from spark-shell): > scala> spark.sql("select current_date") > org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given > input columns: []; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > ... 48 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16836) Hive date/time function error
[ https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402635#comment-15402635 ] Apache Spark commented on SPARK-16836: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14442 > Hive date/time function error > - > > Key: SPARK-16836 > URL: https://issues.apache.org/jira/browse/SPARK-16836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jesse Lord >Priority: Minor > > Previously available hive functions for date/time are not available in Spark > 2.0 (e.g. current_date, current_timestamp). These functions work in Spark > 1.6.2 with HiveContext. > Example (from spark-shell): > scala> spark.sql("select current_date") > org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given > input columns: []; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > ... 48 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16836) Hive date/time function error
[ https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16836: Assignee: (was: Apache Spark) > Hive date/time function error > - > > Key: SPARK-16836 > URL: https://issues.apache.org/jira/browse/SPARK-16836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jesse Lord >Priority: Minor > > Previously available hive functions for date/time are not available in Spark > 2.0 (e.g. current_date, current_timestamp). These functions work in Spark > 1.6.2 with HiveContext. > Example (from spark-shell): > scala> spark.sql("select current_date") > org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given > input columns: []; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > ... 48 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Magrino updated SPARK-16837: Description: Right now, the constructors for the TimeWindow expression in Catalyst incorrectly uses the windowDuration in place of the slideDuration. This will cause incorrect windowing semantics after time window expressions are analyzed by Catalyst. Relevant code is here: https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 was: Right now, the constructors for the TimeWindow expression in Catalyst incorrectly uses the windowDuration in place of the slideDuration. This will cause incorrect windowing semantics the after time window expressions are analyzed by Catalyst. Relevant code is here: https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16837: Assignee: (was: Apache Spark) > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics the after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402600#comment-15402600 ] Apache Spark commented on SPARK-16837: -- User 'tmagrino' has created a pull request for this issue: https://github.com/apache/spark/pull/14441 > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics the after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16837: Assignee: Apache Spark > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino >Assignee: Apache Spark > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics the after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16768) pyspark calls incorrect version of logistic regression
[ https://issues.apache.org/jira/browse/SPARK-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402599#comment-15402599 ] Colin Beckingham commented on SPARK-16768: -- Sean said "If you mean the calling stack trace..." - well the information came from the PySpark Spark Jobs browser page, from the "Completed Jobs" in the "Description" column. In 1.6.2 the page is wonderfully reassuringly informing me that it is using L-BFGS, and in 2.1 it seems to be doing something else. Maybe it is not, it is doing exactly as required, in which case I stand corrected and now know how the 2.1 page should be interpreted differently. No problem. > pyspark calls incorrect version of logistic regression > -- > > Key: SPARK-16768 > URL: https://issues.apache.org/jira/browse/SPARK-16768 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Environment: Linux openSUSE Leap 42.1 Gnome >Reporter: Colin Beckingham > > PySpark call with Spark 1.6.2 "LogisticRegressionWithLBFGS.train()" runs > "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark > 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This non-optimized > version is much slower and produces a different answer from LBFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402592#comment-15402592 ] holdenk commented on SPARK-16775: - Yes so my plan is to replace it with the new API in all of the places where I can - but in the places where that isn't reasonable (like places where the old accumulator API depends on its self) do some slight of hand with private internal methods to make the warnings go away. > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Core, SQL >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org