[jira] [Assigned] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8187: --- Assignee: (was: Apache Spark) > date/time function: date_sub > > > Key: SPARK-8187 > URL: https://issues.apache.org/jira/browse/SPARK-8187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > date_sub(string startdate, int days): string > date_sub(date startdate, int days): date > Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = > '2008-12-30'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8185) date/time function: datediff
[ https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8185: --- Assignee: (was: Apache Spark) > date/time function: datediff > > > Key: SPARK-8185 > URL: https://issues.apache.org/jira/browse/SPARK-8185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > datediff(date enddate, date startdate): int > Returns the number of days from startdate to enddate: datediff('2009-03-01', > '2009-02-27') = 2. > See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8185) date/time function: datediff
[ https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583058#comment-14583058 ] Apache Spark commented on SPARK-8185: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6782 > date/time function: datediff > > > Key: SPARK-8185 > URL: https://issues.apache.org/jira/browse/SPARK-8185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > datediff(date enddate, date startdate): int > Returns the number of days from startdate to enddate: datediff('2009-03-01', > '2009-02-27') = 2. > See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8186: --- Assignee: Apache Spark > date/time function: date_add > > > Key: SPARK-8186 > URL: https://issues.apache.org/jira/browse/SPARK-8186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > date_add(string startdate, int days): string > date_add(date startdate, int days): date > Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583060#comment-14583060 ] Apache Spark commented on SPARK-8187: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6782 > date/time function: date_sub > > > Key: SPARK-8187 > URL: https://issues.apache.org/jira/browse/SPARK-8187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > date_sub(string startdate, int days): string > date_sub(date startdate, int days): date > Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = > '2008-12-30'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583059#comment-14583059 ] Apache Spark commented on SPARK-8186: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6782 > date/time function: date_add > > > Key: SPARK-8186 > URL: https://issues.apache.org/jira/browse/SPARK-8186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > date_add(string startdate, int days): string > date_add(date startdate, int days): date > Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8186: --- Assignee: (was: Apache Spark) > date/time function: date_add > > > Key: SPARK-8186 > URL: https://issues.apache.org/jira/browse/SPARK-8186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > date_add(string startdate, int days): string > date_add(date startdate, int days): date > Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8187: --- Assignee: Apache Spark > date/time function: date_sub > > > Key: SPARK-8187 > URL: https://issues.apache.org/jira/browse/SPARK-8187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > date_sub(string startdate, int days): string > date_sub(date startdate, int days): date > Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = > '2008-12-30'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8185) date/time function: datediff
[ https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8185: --- Assignee: Apache Spark > date/time function: datediff > > > Key: SPARK-8185 > URL: https://issues.apache.org/jira/browse/SPARK-8185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > datediff(date enddate, date startdate): int > Returns the number of days from startdate to enddate: datediff('2009-03-01', > '2009-02-27') = 2. > See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7284: --- Assignee: Apache Spark (was: Tathagata Das) > Update streaming documentation for Spark 1.4.0 release > -- > > Key: SPARK-7284 > URL: https://issues.apache.org/jira/browse/SPARK-7284 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Critical > > Things to update (continuously updated list) > - Python API for Kafka Direct > - Pointers to the new Streaming UI > - Update Kafka version to 0.8.2.1 > - Add ref to RDD.foreachPartitionWithIndex (if merged) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583054#comment-14583054 ] Apache Spark commented on SPARK-7284: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6781 > Update streaming documentation for Spark 1.4.0 release > -- > > Key: SPARK-7284 > URL: https://issues.apache.org/jira/browse/SPARK-7284 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Things to update (continuously updated list) > - Python API for Kafka Direct > - Pointers to the new Streaming UI > - Update Kafka version to 0.8.2.1 > - Add ref to RDD.foreachPartitionWithIndex (if merged) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7284: --- Assignee: Tathagata Das (was: Apache Spark) > Update streaming documentation for Spark 1.4.0 release > -- > > Key: SPARK-7284 > URL: https://issues.apache.org/jira/browse/SPARK-7284 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Things to update (continuously updated list) > - Python API for Kafka Direct > - Pointers to the new Streaming UI > - Update Kafka version to 0.8.2.1 > - Add ref to RDD.foreachPartitionWithIndex (if merged) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7289) Combine Limit and Sort to avoid total ordering
[ https://issues.apache.org/jira/browse/SPARK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583042#comment-14583042 ] Apache Spark commented on SPARK-7289: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6780 > Combine Limit and Sort to avoid total ordering > -- > > Key: SPARK-7289 > URL: https://issues.apache.org/jira/browse/SPARK-7289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Fei Wang > > Optimize following sql > select key from (select * from testData order by key) t limit 5 > from > == Parsed Logical Plan == > 'Limit 5 > 'Project ['key] > 'Subquery t >'Sort ['key ASC], true > 'Project [*] > 'UnresolvedRelation [testData], None > == Analyzed Logical Plan == > Limit 5 > Project [key#0] > Subquery t >Sort [key#0 ASC], true > Project [key#0,value#1] > Subquery testData > LogicalRDD [key#0,value#1], MapPartitionsRDD[1] > == Optimized Logical Plan == > Limit 5 > Project [key#0] > Sort [key#0 ASC], true >LogicalRDD [key#0,value#1], MapPartitionsRDD[1] > == Physical Plan == > Limit 5 > Project [key#0] > Sort [key#0 ASC], true >Exchange (RangePartitioning [key#0 ASC], 5), [] > PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] > to > == Parsed Logical Plan == > 'Limit 5 > 'Project ['key] > 'Subquery t >'Sort ['key ASC], true > 'Project [*] > 'UnresolvedRelation [testData], None > == Analyzed Logical Plan == > Limit 5 > Project [key#0] > Subquery t >Sort [key#0 ASC], true > Project [key#0,value#1] > Subquery testData > LogicalRDD [key#0,value#1], MapPartitionsRDD[1] > == Optimized Logical Plan == > Project [key#0] > Limit 5 > Sort [key#0 ASC], true >LogicalRDD [key#0,value#1], MapPartitionsRDD[1] > == Physical Plan == > Project [key#0] > TakeOrdered 5, [key#0 ASC] > PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7267) Push down Project when it's child is Limit
[ https://issues.apache.org/jira/browse/SPARK-7267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583041#comment-14583041 ] Apache Spark commented on SPARK-7267: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6780 > Push down Project when it's child is Limit > --- > > Key: SPARK-7267 > URL: https://issues.apache.org/jira/browse/SPARK-7267 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Zhongshuai Pei >Assignee: Zhongshuai Pei >Priority: Critical > Fix For: 1.4.0 > > > SQL > {quote} > select key from (select key,value from t1 limit 100) t2 limit 10 > {quote} > Optimized Logical Plan before modifying > {quote} > == Optimized Logical Plan == > Limit 10 > Project [key#228] > Limit 100 >MetastoreRelation default, t1, None > {quote} > Optimized Logical Plan after modifying > {quote} > == Optimized Logical Plan == > Limit 10 > Limit 100 >Project [key#228] > MetastoreRelation default, t1, None > {quote} > After this, we can combine limits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8234) misc function: md5
[ https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583034#comment-14583034 ] Apache Spark commented on SPARK-8234: - User 'qiansl127' has created a pull request for this issue: https://github.com/apache/spark/pull/6779 > misc function: md5 > -- > > Key: SPARK-8234 > URL: https://issues.apache.org/jira/browse/SPARK-8234 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > md5(string/binary): string > Calculates an MD5 128-bit checksum for the string or binary (as of Hive > 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the > argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8234) misc function: md5
[ https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8234: --- Assignee: (was: Apache Spark) > misc function: md5 > -- > > Key: SPARK-8234 > URL: https://issues.apache.org/jira/browse/SPARK-8234 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > md5(string/binary): string > Calculates an MD5 128-bit checksum for the string or binary (as of Hive > 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the > argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8234) misc function: md5
[ https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8234: --- Assignee: Apache Spark > misc function: md5 > -- > > Key: SPARK-8234 > URL: https://issues.apache.org/jira/browse/SPARK-8234 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > md5(string/binary): string > Calculates an MD5 128-bit checksum for the string or binary (as of Hive > 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the > argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8323: --- Assignee: (was: Apache Spark) > Remove mapOutputTracker field in TaskSchedulerImpl > -- > > Key: SPARK-8323 > URL: https://issues.apache.org/jira/browse/SPARK-8323 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: patrickliu > > Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in > TaskSetManager. > I think we could remove the mapOutputTracker field in the TaskSchedulerImpl > class. > Instead, we could reference the mapOutputTracker from SparkEnv directly in > TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583033#comment-14583033 ] Apache Spark commented on SPARK-8323: - User 'yufan-liu' has created a pull request for this issue: https://github.com/apache/spark/pull/6778 > Remove mapOutputTracker field in TaskSchedulerImpl > -- > > Key: SPARK-8323 > URL: https://issues.apache.org/jira/browse/SPARK-8323 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: patrickliu > > Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in > TaskSetManager. > I think we could remove the mapOutputTracker field in the TaskSchedulerImpl > class. > Instead, we could reference the mapOutputTracker from SparkEnv directly in > TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8323: --- Assignee: Apache Spark > Remove mapOutputTracker field in TaskSchedulerImpl > -- > > Key: SPARK-8323 > URL: https://issues.apache.org/jira/browse/SPARK-8323 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: patrickliu >Assignee: Apache Spark > > Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in > TaskSetManager. > I think we could remove the mapOutputTracker field in the TaskSchedulerImpl > class. > Instead, we could reference the mapOutputTracker from SparkEnv directly in > TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl
patrickliu created SPARK-8323: - Summary: Remove mapOutputTracker field in TaskSchedulerImpl Key: SPARK-8323 URL: https://issues.apache.org/jira/browse/SPARK-8323 Project: Spark Issue Type: Improvement Components: Scheduler, Spark Core Reporter: patrickliu Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in TaskSetManager. I think we could remove the mapOutputTracker field in the TaskSchedulerImpl class. Instead, we could reference the mapOutputTracker from SparkEnv directly in TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6566. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5889 [https://github.com/apache/spark/pull/5889] > Update Spark to use the latest version of Parquet libraries > --- > > Key: SPARK-6566 > URL: https://issues.apache.org/jira/browse/SPARK-6566 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Konstantin Shaposhnikov > Fix For: 1.5.0 > > > There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). > E.g. PARQUET-136 > It would be good to update Spark to use the latest parquet version. > The following changes are required: > {code} > diff --git a/pom.xml b/pom.xml > index 5ad39a9..095b519 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -132,7 +132,7 @@ > > 0.13.1 > 10.10.1.1 > -1.6.0rc3 > +1.6.0rc7 > 1.2.3 > 8.1.14.v20131031 > 3.0.0.v201112011016 > {code} > and > {code} > --- > a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala > +++ > b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala > @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat > globalMetaData = new GlobalMetaData(globalMetaData.getSchema, >mergedMetadata, globalMetaData.getCreatedBy) > > -val readContext = getReadSupport(configuration).init( > +val readContext = > ParquetInputFormat.getReadSupportInstance(configuration).init( >new InitContext(configuration, > globalMetaData.getKeyValueMetaData, > globalMetaData.getSchema)) > {code} > I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors
[ https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582977#comment-14582977 ] Shivaram Venkataraman commented on SPARK-8311: -- Yeah it looks very similar. I'll close this and follow 8057 > saveAsTextFile with Hadoop1 could lead to errors > > > Key: SPARK-8311 > URL: https://issues.apache.org/jira/browse/SPARK-8311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Shivaram Venkataraman > > I've run into this bug a couple of times and wanted to document things I have > found so far in a JIRA. From what I see if an application is linked to > Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the > saveAsTextFile call consistently fails with errors of the form > {code} > 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 > (TID 13, ip-10-212-141-222.us-west-2.compute.internal): > java.lang.IncompatibleClassChangeError: Found class > org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) > {code} > This does not happen in 1.2.1 > I think the bug is caused by the following commit > https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240 > where we the function `commitTask` assumes that the mrTaskContext is always > a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1. But > this is just a hypothesis as I haven't tried reverting this to see if the > problem goes away > cc [~liancheng] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582971#comment-14582971 ] Mark Smith commented on SPARK-8322: --- This is the backport to branch-1.4 > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582970#comment-14582970 ] Apache Spark commented on SPARK-8322: - User 'markmsmith' has created a pull request for this issue: https://github.com/apache/spark/pull/6777 > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7862: Assignee: zhichao-li > Query would hang when the using script has error output in SparkSQL > --- > > Key: SPARK-7862 > URL: https://issues.apache.org/jira/browse/SPARK-7862 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: zhichao-li >Assignee: zhichao-li > Fix For: 1.5.0 > > > Steps to reproduce: > val data = (1 to 10).map { i => (i, i, i) } > data.toDF("d1", "d2", "d3").registerTempTable("script_trans") > sql("SELECT TRANSFORM (d1, d2, d3) USING 'cat 1>&2' AS (a,b,c) FROM > script_trans") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582965#comment-14582965 ] Peng Cheng commented on SPARK-7442: --- Still not fixed in 1.4.0 ... reverting to hadoop 2.4 until this is resolved. > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7862) Query would hang when the using script has error output in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7862. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6404 [https://github.com/apache/spark/pull/6404] > Query would hang when the using script has error output in SparkSQL > --- > > Key: SPARK-7862 > URL: https://issues.apache.org/jira/browse/SPARK-7862 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: zhichao-li > Fix For: 1.5.0 > > > Steps to reproduce: > val data = (1 to 10).map { i => (i, i, i) } > data.toDF("d1", "d2", "d3").registerTempTable("script_trans") > sql("SELECT TRANSFORM (d1, d2, d3) USING 'cat 1>&2' AS (a,b,c) FROM > script_trans") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Smith updated SPARK-8322: -- Comment: was deleted (was: This should probably also be back-ported from master to the 1.4 branch, but I haven't made a pull request for that.) > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8317) Do not push sort into shuffle in Exchange operator
[ https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8317. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6772 [https://github.com/apache/spark/pull/6772] > Do not push sort into shuffle in Exchange operator > -- > > Key: SPARK-8317 > URL: https://issues.apache.org/jira/browse/SPARK-8317 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.0 > > > In some cases, Spark SQL pushes sorting operations into the shuffle layer by > specifying a key ordering as part of the shuffle dependency. I think that we > should not do this: > - Since we do not delegate aggregation to Spark's shuffle, specifying the > keyOrdering as part of the shuffle has no effect on the shuffle map side. > - By performing the shuffle ourselves (by inserting a sort operator after the > shuffle instead), we can use the Exchange planner to choose specialized > sorting implementations based on the types of rows being sorted. > - We can remove some complexity from SqlSerializer2 by not requiring it to > know about sort orderings, since SQL's own sort operators will already > perform the necessary defensive copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582948#comment-14582948 ] Mark Smith commented on SPARK-8322: --- This should probably also be back-ported from master to the 1.4 branch, but I haven't made a pull request for that. > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8322: --- Assignee: Apache Spark > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith >Assignee: Apache Spark > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Smith updated SPARK-8322: -- Target Version/s: (was: 1.4.0) Fix Version/s: (was: 1.4.0) > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582942#comment-14582942 ] Apache Spark commented on SPARK-8322: - User 'markmsmith' has created a pull request for this issue: https://github.com/apache/spark/pull/6776 > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8322: --- Assignee: (was: Apache Spark) > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582937#comment-14582937 ] Sean Owen commented on SPARK-8322: -- Related to SPARK-8310. You'll probably want a PR for both master and 1.4 here. CC [~shivaram] > EC2 script not fully updated for 1.4.0 release > -- > > Key: SPARK-8322 > URL: https://issues.apache.org/jira/browse/SPARK-8322 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Mark Smith > Labels: easyfix > Fix For: 1.4.0 > > > In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to > the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to > break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
Mark Smith created SPARK-8322: - Summary: EC2 script not fully updated for 1.4.0 release Key: SPARK-8322 URL: https://issues.apache.org/jira/browse/SPARK-8322 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Mark Smith Fix For: 1.4.0 In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors
[ https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8311. -- Resolution: Duplicate Yes 95% sure that's a duplicate > saveAsTextFile with Hadoop1 could lead to errors > > > Key: SPARK-8311 > URL: https://issues.apache.org/jira/browse/SPARK-8311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Shivaram Venkataraman > > I've run into this bug a couple of times and wanted to document things I have > found so far in a JIRA. From what I see if an application is linked to > Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the > saveAsTextFile call consistently fails with errors of the form > {code} > 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 > (TID 13, ip-10-212-141-222.us-west-2.compute.internal): > java.lang.IncompatibleClassChangeError: Found class > org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) > {code} > This does not happen in 1.2.1 > I think the bug is caused by the following commit > https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240 > where we the function `commitTask` assumes that the mrTaskContext is always > a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1. But > this is just a hypothesis as I haven't tried reverting this to see if the > problem goes away > cc [~liancheng] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582929#comment-14582929 ] Sean Owen commented on SPARK-8318: -- Minor, but doesn't Component + label = starter already capture that? instead of having to maintain and eventually resolve (?) another JIRA > Spark Streaming Starter JIRAs > - > > Key: SPARK-8318 > URL: https://issues.apache.org/jira/browse/SPARK-8318 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Priority: Minor > Labels: starter > > This is a master JIRA to collect together all starter tasks related to Spark > Streaming. These are simple tasks that contributors can do to get familiar > with the process of contributing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql
Sunil created SPARK-8321: Summary: Authorization Support(on all operations not only DDL) in Spark Sql Key: SPARK-8321 URL: https://issues.apache.org/jira/browse/SPARK-8321 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.3.0 Reporter: Sunil Currently If you run Spark SQL with thrift server it only support Authentication and limited authorization support(DDL). Want to extend it to provide full authorization or provide a plug able authorization like Apache sentry so that user with proper roles can access data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582871#comment-14582871 ] Patrick Grandjean commented on SPARK-7768: -- Registering UDTs for existing classes would be the perfect solution for SPARK-6875 (https://issues.apache.org/jira/browse/SPARK-6875) > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py
[ https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582857#comment-14582857 ] Joseph K. Bradley commented on SPARK-8120: -- Hm, I must have not looked carefully. Sorry about the trouble! I'll close the JIRA. > Typos in warning message in sql/types.py > > > Key: SPARK-8120 > URL: https://issues.apache.org/jira/browse/SPARK-8120 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > See > [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093] > Need to fix string concat + use of % -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8120) Typos in warning message in sql/types.py
[ https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-8120. Resolution: Not A Problem > Typos in warning message in sql/types.py > > > Key: SPARK-8120 > URL: https://issues.apache.org/jira/browse/SPARK-8120 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > See > [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093] > Need to fix string concat + use of % -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8240: --- Assignee: Cheng Hao (was: Apache Spark) > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8240: --- Assignee: Apache Spark (was: Cheng Hao) > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8240) string function: concat
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582849#comment-14582849 ] Apache Spark commented on SPARK-8240: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6775 > string function: concat > --- > > Key: SPARK-8240 > URL: https://issues.apache.org/jira/browse/SPARK-8240 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > concat(string|binary A, string|binary B...): string / binary > Returns the string or bytes resulting from concatenating the strings or bytes > passed in as parameters in order. For example, concat('foo', 'bar') results > in 'foobar'. Note that this function can take any number of input strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8241) string function: concat_ws
[ https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8241: --- Assignee: Apache Spark (was: Cheng Hao) > string function: concat_ws > -- > > Key: SPARK-8241 > URL: https://issues.apache.org/jira/browse/SPARK-8241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > concat_ws(string SEP, string A, string B...): string > concat_ws(string SEP, array): string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8241) string function: concat_ws
[ https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582850#comment-14582850 ] Apache Spark commented on SPARK-8241: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6775 > string function: concat_ws > -- > > Key: SPARK-8241 > URL: https://issues.apache.org/jira/browse/SPARK-8241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > concat_ws(string SEP, string A, string B...): string > concat_ws(string SEP, array): string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8241) string function: concat_ws
[ https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8241: --- Assignee: Cheng Hao (was: Apache Spark) > string function: concat_ws > -- > > Key: SPARK-8241 > URL: https://issues.apache.org/jira/browse/SPARK-8241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao > > concat_ws(string SEP, string A, string B...): string > concat_ws(string SEP, array): string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582832#comment-14582832 ] Apache Spark commented on SPARK-8129: - User 'kanzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/6774 > Securely pass auth secrets to executors in standalone cluster mode > -- > > Key: SPARK-8129 > URL: https://issues.apache.org/jira/browse/SPARK-8129 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Reporter: Kan Zhang >Priority: Critical > > Currently, when authentication is turned on, the standalone cluster manager > passes auth secrets to executors (also drivers in cluster mode) as java > options on the command line, which isn't secure. The passed secret can be > seen by anyone running 'ps' command, e.g., > bq. 501 94787 94734 0 2:32PM ?? 0:00.78 > /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java > -cp > /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar > -Xms512M -Xmx512M > *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* > -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler > --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id > app-20150605143259- --worker-url > akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582829#comment-14582829 ] Tathagata Das commented on SPARK-6892: -- [~hshreedharan] Could you take a look at this? I think event logging directory already exists, that is causing this issue. > Recovery from checkpoint will also reuse the application id when write > eventLog in yarn-cluster mode > > > Key: SPARK-6892 > URL: https://issues.apache.org/jira/browse/SPARK-6892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: yangping wu >Priority: Critical > > When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, > I found it will reuse the application id (In my case is > application_1428664056212_0016) before falied to write spark eventLog, But > now my application id is application_1428664056212_0017,then spark write > eventLog will falied, the stacktrace as follow: > {code} > 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' > failed, java.io.IOException: Target log file already exists > (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) > java.io.IOException: Target log file already exists > (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) > at > org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) > at > org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) > at > org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) > at scala.Option.foreach(Option.scala:236) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} > This exception will cause the job falied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582796#comment-14582796 ] Saisai Shao commented on SPARK-8297: OK, thanks [~mridulm80], I will take a try, from my understanding, akka will also get notified if connection is abruptly lost, I didn't test it, will take a try. > Scheduler backend is not notified in case node fails in YARN > > > Key: SPARK-8297 > URL: https://issues.apache.org/jira/browse/SPARK-8297 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 > Environment: Spark on yarn - both client and cluster mode. >Reporter: Mridul Muralidharan >Priority: Critical > > When a node crashes, yarn detects the failure and notifies spark - but this > information is not propagated to scheduler backend (unlike in mesos mode, for > example). > It results in repeated re-execution of stages (due to FetchFailedException on > shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py
[ https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582797#comment-14582797 ] Jihun Kang commented on SPARK-8120: --- I think it works as expected. I had a following output, and there is no errors. {noformat} field name __c__ can not be accessed in Python,use position to access it instead "use position to access it instead" % name) {noformat} > Typos in warning message in sql/types.py > > > Key: SPARK-8120 > URL: https://issues.apache.org/jira/browse/SPARK-8120 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > See > [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093] > Need to fix string concat + use of % -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8319) Update logic related to key ordering in shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8319: --- Assignee: Apache Spark (was: Josh Rosen) > Update logic related to key ordering in shuffle dependencies > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8319) Update logic related to key ordering in shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582788#comment-14582788 ] Apache Spark commented on SPARK-8319: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6773 > Update logic related to key ordering in shuffle dependencies > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8319) Update logic related to key ordering in shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8319: --- Assignee: Josh Rosen (was: Apache Spark) > Update logic related to key ordering in shuffle dependencies > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7158) collect and take return different results
[ https://issues.apache.org/jira/browse/SPARK-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7158: Assignee: Cheng Hao > collect and take return different results > - > > Key: SPARK-7158 > URL: https://issues.apache.org/jira/browse/SPARK-7158 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Hao >Priority: Blocker > Fix For: 1.5.0 > > > Reported by [~rams] > {code} > import java.util.UUID > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(List(1,2,3), 2) > val schema = StructType(List(StructField("index",IntegerType,true))) > val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema) > def id:() => String = () => {UUID.randomUUID().toString()} > def square:Int => Int = (x: Int) => {x * x} > val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect > the ID to have materialized at this point > dfWithId.collect() > //res0: Array[org.apache.spark.sql.Row] = > Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], > [2,efd061be-e8cc-43fa-956e-cfd6e7355982], > [3,79b0baab-627c-4761-af0d-8995b8c5a125]) > val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, > IntegerType, col("index"))) > dfWithIdAndSquare.collect() > //res1: Array[org.apache.spark.sql.Row] = > Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], > [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], > [3,209f269e-207a-4dfd-a186-738be5db2eff,9]) > //why are the IDs in lines 11 and 15 different? > {code} > The randomly generated IDs are the same if show (which uses take under the > hood) is used instead of collect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8319) Update logic related to key ordering in shuffle dependencies
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8319: -- Summary: Update logic related to key ordering in shuffle dependencies (was: Update several pieces of shuffle logic related to key orderings) > Update logic related to key ordering in shuffle dependencies > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7158) collect and take return different results
[ https://issues.apache.org/jira/browse/SPARK-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7158. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5714 [https://github.com/apache/spark/pull/5714] > collect and take return different results > - > > Key: SPARK-7158 > URL: https://issues.apache.org/jira/browse/SPARK-7158 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > Fix For: 1.5.0 > > > Reported by [~rams] > {code} > import java.util.UUID > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(List(1,2,3), 2) > val schema = StructType(List(StructField("index",IntegerType,true))) > val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema) > def id:() => String = () => {UUID.randomUUID().toString()} > def square:Int => Int = (x: Int) => {x * x} > val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect > the ID to have materialized at this point > dfWithId.collect() > //res0: Array[org.apache.spark.sql.Row] = > Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], > [2,efd061be-e8cc-43fa-956e-cfd6e7355982], > [3,79b0baab-627c-4761-af0d-8995b8c5a125]) > val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, > IntegerType, col("index"))) > dfWithIdAndSquare.collect() > //res1: Array[org.apache.spark.sql.Row] = > Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], > [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], > [3,209f269e-207a-4dfd-a186-738be5db2eff,9]) > //why are the IDs in lines 11 and 15 different? > {code} > The randomly generated IDs are the same if show (which uses take under the > hood) is used instead of collect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582768#comment-14582768 ] Alexander Ulanov commented on SPARK-5575: - Hi Janani, There is already an implemenation of DBN (and RBM) by [~gq]. You can find it here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8319) Update several pieces of shuffle logic related to key orderings
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8319: -- Summary: Update several pieces of shuffle logic related to key orderings (was: Enable Tungsten shuffle manager for some shuffles that specify key orderings) > Update several pieces of shuffle logic related to key orderings > --- > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8319: -- Description: The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. We should update the fallback logic to handle this case so that the Tungsten optimizations can apply to more workloads. I also noticed that the SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary: the only shuffle manager that performs sorting on the map side is SortShuffleManager, and it only performs sorting if an aggregator is specified. SQL never uses Spark's shuffle for performing aggregation, so this copying is unnecessary. was: The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. We should update the fallback logic to handle this case so that the Tungsten optimizations can apply to more workloads. I also noticed that the SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary: > Enable Tungsten shuffle manager for some shuffles that specify key orderings > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: the > only shuffle manager that performs sorting on the map side is > SortShuffleManager, and it only performs sorting if an aggregator is > specified. SQL never uses Spark's shuffle for performing aggregation, so > this copying is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8319: -- Description: The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. We should update the fallback logic to handle this case so that the Tungsten optimizations can apply to more workloads. I also noticed that the SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary: was:The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. We should update the fallback logic to handle this case so that the Tungsten optimizations can apply to more workloads. > Enable Tungsten shuffle manager for some shuffles that specify key orderings > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. > I also noticed that the SQL Exchange operator performs defensive copying of > shuffle inputs when a key ordering is specified, but this is unnecessary: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8318: - Description: This is a master JIRA to collect together all starter tasks related to Spark Streaming. These are simple tasks that contributors can do to get familiar with the process of contributing. (was: This is a master JIRA to collect together all starter tasks related to Spark Streaming) > Spark Streaming Starter JIRAs > - > > Key: SPARK-8318 > URL: https://issues.apache.org/jira/browse/SPARK-8318 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Priority: Minor > Labels: starter > > This is a master JIRA to collect together all starter tasks related to Spark > Streaming. These are simple tasks that contributors can do to get familiar > with the process of contributing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings
[ https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8319: -- Component/s: SQL > Enable Tungsten shuffle manager for some shuffles that specify key orderings > > > Key: SPARK-8319 > URL: https://issues.apache.org/jira/browse/SPARK-8319 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever > the shuffle dependency specifies a key ordering, but technically we only need > to fall back when an aggregator is also specified. We should update the > fallback logic to handle this case so that the Tungsten optimizations can > apply to more workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8318: - Labels: starter (was: ) > Spark Streaming Starter JIRAs > - > > Key: SPARK-8318 > URL: https://issues.apache.org/jira/browse/SPARK-8318 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Priority: Minor > Labels: starter > > This is a master JIRA to collect together all starter tasks related to Spark > Streaming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams
Tathagata Das created SPARK-8320: Summary: Add example in streaming programming guide that shows union of multiple input streams Key: SPARK-8320 URL: https://issues.apache.org/jira/browse/SPARK-8320 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Priority: Minor The section on "Level of Parallelism in Data Receiving" has a Scala and a Java example for union of multiple input streams. A python example should be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings
Josh Rosen created SPARK-8319: - Summary: Enable Tungsten shuffle manager for some shuffles that specify key orderings Key: SPARK-8319 URL: https://issues.apache.org/jira/browse/SPARK-8319 Project: Spark Issue Type: Improvement Components: Shuffle Reporter: Josh Rosen Assignee: Josh Rosen The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. We should update the fallback logic to handle this case so that the Tungsten optimizations can apply to more workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8318) Spark Streaming Starter JIRAs
Tathagata Das created SPARK-8318: Summary: Spark Streaming Starter JIRAs Key: SPARK-8318 URL: https://issues.apache.org/jira/browse/SPARK-8318 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Priority: Minor This is a master JIRA to collect together all starter tasks related to Spark Streaming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors
[ https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582741#comment-14582741 ] Patrick Wendell commented on SPARK-8311: Is this related to or the same as SPARK-8057? > saveAsTextFile with Hadoop1 could lead to errors > > > Key: SPARK-8311 > URL: https://issues.apache.org/jira/browse/SPARK-8311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Shivaram Venkataraman > > I've run into this bug a couple of times and wanted to document things I have > found so far in a JIRA. From what I see if an application is linked to > Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the > saveAsTextFile call consistently fails with errors of the form > {code} > 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 > (TID 13, ip-10-212-141-222.us-west-2.compute.internal): > java.lang.IncompatibleClassChangeError: Found class > org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) > {code} > This does not happen in 1.2.1 > I think the bug is caused by the following commit > https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240 > where we the function `commitTask` assumes that the mrTaskContext is always > a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1. But > this is just a hypothesis as I haven't tried reverting this to see if the > problem goes away > cc [~liancheng] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582720#comment-14582720 ] Jesika Haria edited comment on SPARK-5493 at 6/12/15 12:20 AM: --- Trying to support impersonation with the pyspark. It works with the proxy-user flag set on the command line: {code} pyspark --master yarn-client --proxy-user foo {code} However, I actually need to set up the Spark Context programmatically via the Python API, but could find no documentation for this. Is pyspark impersonation via proxy-user even supported at this time? In the absence of this functionality, what is the recommended way of supporting impersonation (especially if setting the HADOOP_PROXY_USER env variable is discouraged in production)? Or if there is a spark config property that corresponds to the proxy-user flag, that would be great too (cannot see one at https://spark.apache.org/docs/latest/configuration.html) was (Author: jesika): Trying to support impersonation with the pyspark. It works with the proxy-user flag set on the command line: {code} pyspark --master yarn-client --proxy-user foo {code} However, I actually need to set up the Spark Context programmatically via the Python API, but could find no documentation for this. Is pyspark impersonation via proxy-user even supported at this time? In the absence of this functionality, what is the recommended way of supporting impersonation (especially if setting the HADOOP_PROXY_USER env variable is discouraged in production)? > Support proxy users under kerberos > -- > > Key: SPARK-5493 > URL: https://issues.apache.org/jira/browse/SPARK-5493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Brock Noland >Assignee: Marcelo Vanzin > Fix For: 1.3.0 > > > When using kerberos, services may want to use spark-submit to submit jobs as > a separate user. For example a service like hive might want to submit jobs as > a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8317) Do not push sort into shuffle in Exchange operator
[ https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582724#comment-14582724 ] Apache Spark commented on SPARK-8317: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6772 > Do not push sort into shuffle in Exchange operator > -- > > Key: SPARK-8317 > URL: https://issues.apache.org/jira/browse/SPARK-8317 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > In some cases, Spark SQL pushes sorting operations into the shuffle layer by > specifying a key ordering as part of the shuffle dependency. I think that we > should not do this: > - Since we do not delegate aggregation to Spark's shuffle, specifying the > keyOrdering as part of the shuffle has no effect on the shuffle map side. > - By performing the shuffle ourselves (by inserting a sort operator after the > shuffle instead), we can use the Exchange planner to choose specialized > sorting implementations based on the types of rows being sorted. > - We can remove some complexity from SqlSerializer2 by not requiring it to > know about sort orderings, since SQL's own sort operators will already > perform the necessary defensive copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8317) Do not push sort into shuffle in Exchange operator
[ https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8317: --- Assignee: Josh Rosen (was: Apache Spark) > Do not push sort into shuffle in Exchange operator > -- > > Key: SPARK-8317 > URL: https://issues.apache.org/jira/browse/SPARK-8317 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > In some cases, Spark SQL pushes sorting operations into the shuffle layer by > specifying a key ordering as part of the shuffle dependency. I think that we > should not do this: > - Since we do not delegate aggregation to Spark's shuffle, specifying the > keyOrdering as part of the shuffle has no effect on the shuffle map side. > - By performing the shuffle ourselves (by inserting a sort operator after the > shuffle instead), we can use the Exchange planner to choose specialized > sorting implementations based on the types of rows being sorted. > - We can remove some complexity from SqlSerializer2 by not requiring it to > know about sort orderings, since SQL's own sort operators will already > perform the necessary defensive copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8317) Do not push sort into shuffle in Exchange operator
[ https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8317: --- Assignee: Apache Spark (was: Josh Rosen) > Do not push sort into shuffle in Exchange operator > -- > > Key: SPARK-8317 > URL: https://issues.apache.org/jira/browse/SPARK-8317 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > In some cases, Spark SQL pushes sorting operations into the shuffle layer by > specifying a key ordering as part of the shuffle dependency. I think that we > should not do this: > - Since we do not delegate aggregation to Spark's shuffle, specifying the > keyOrdering as part of the shuffle has no effect on the shuffle map side. > - By performing the shuffle ourselves (by inserting a sort operator after the > shuffle instead), we can use the Exchange planner to choose specialized > sorting implementations based on the types of rows being sorted. > - We can remove some complexity from SqlSerializer2 by not requiring it to > know about sort orderings, since SQL's own sort operators will already > perform the necessary defensive copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582720#comment-14582720 ] Jesika Haria commented on SPARK-5493: - Trying to support impersonation with the pyspark. It works with the proxy-user flag set on the command line: {code} pyspark --master yarn-client --proxy-user foo {code} However, I actually need to set up the Spark Context programmatically via the Python API, but could find no documentation for this. Is pyspark impersonation via proxy-user even supported at this time? In the absence of this functionality, what is the recommended way of supporting impersonation (especially if setting the HADOOP_PROXY_USER env variable is discouraged in production)? > Support proxy users under kerberos > -- > > Key: SPARK-5493 > URL: https://issues.apache.org/jira/browse/SPARK-5493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Brock Noland >Assignee: Marcelo Vanzin > Fix For: 1.3.0 > > > When using kerberos, services may want to use spark-submit to submit jobs as > a separate user. For example a service like hive might want to submit jobs as > a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8208) math function: ceiling
[ https://issues.apache.org/jira/browse/SPARK-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8208. Resolution: Fixed Fix Version/s: 1.5.0 > math function: ceiling > -- > > Key: SPARK-8208 > URL: https://issues.apache.org/jira/browse/SPARK-8208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > We already have ceil -- just need to create an alias for it in > FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8211) math function: radians
[ https://issues.apache.org/jira/browse/SPARK-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8211. Resolution: Fixed Fix Version/s: 1.5.0 > math function: radians > -- > > Key: SPARK-8211 > URL: https://issues.apache.org/jira/browse/SPARK-8211 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Alias toRadians -> radians in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8251) string function: alias upper / ucase
[ https://issues.apache.org/jira/browse/SPARK-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8251. Resolution: Fixed Fix Version/s: 1.5.0 > string function: alias upper / ucase > > > Key: SPARK-8251 > URL: https://issues.apache.org/jira/browse/SPARK-8251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Alias upper / ucase in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8229) conditional function: isnotnull
[ https://issues.apache.org/jira/browse/SPARK-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8229. Resolution: Fixed Fix Version/s: 1.5.0 > conditional function: isnotnull > --- > > Key: SPARK-8229 > URL: https://issues.apache.org/jira/browse/SPARK-8229 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Just need to register it in the FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8216) math function: rename log -> ln
[ https://issues.apache.org/jira/browse/SPARK-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8216. Resolution: Fixed Fix Version/s: 1.5.0 > math function: rename log -> ln > --- > > Key: SPARK-8216 > URL: https://issues.apache.org/jira/browse/SPARK-8216 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Rename expression Log -> Ln. > Also create aliased data frame functions, and update FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8225) math function: alias sign / signum
[ https://issues.apache.org/jira/browse/SPARK-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8225. Resolution: Fixed Fix Version/s: 1.5.0 > math function: alias sign / signum > -- > > Key: SPARK-8225 > URL: https://issues.apache.org/jira/browse/SPARK-8225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Alias them in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8228) conditional function: isnull
[ https://issues.apache.org/jira/browse/SPARK-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8228. Resolution: Fixed Fix Version/s: 1.5.0 > conditional function: isnull > > > Key: SPARK-8228 > URL: https://issues.apache.org/jira/browse/SPARK-8228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Just need to register it in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8222) math function: alias power / pow
[ https://issues.apache.org/jira/browse/SPARK-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8222. Resolution: Fixed Fix Version/s: 1.5.0 > math function: alias power / pow > > > Key: SPARK-8222 > URL: https://issues.apache.org/jira/browse/SPARK-8222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Add to FunctionRegistry power. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8219) math function: negative
[ https://issues.apache.org/jira/browse/SPARK-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8219. Resolution: Fixed Fix Version/s: 1.5.0 > math function: negative > --- > > Key: SPARK-8219 > URL: https://issues.apache.org/jira/browse/SPARK-8219 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > This is just an alias for UnaryMinus. Only add it to FunctionRegistry, and > not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8210) math function: degrees
[ https://issues.apache.org/jira/browse/SPARK-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8210. Resolution: Fixed Fix Version/s: 1.5.0 > math function: degrees > -- > > Key: SPARK-8210 > URL: https://issues.apache.org/jira/browse/SPARK-8210 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Alias todegrees -> degrees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8250) string function: alias lower/lcase
[ https://issues.apache.org/jira/browse/SPARK-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8250. Resolution: Fixed Fix Version/s: 1.5.0 > string function: alias lower/lcase > -- > > Key: SPARK-8250 > URL: https://issues.apache.org/jira/browse/SPARK-8250 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Alias lower/lcase in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8205) conditional function: nvl
[ https://issues.apache.org/jira/browse/SPARK-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8205. Resolution: Fixed Fix Version/s: 1.5.0 > conditional function: nvl > - > > Key: SPARK-8205 > URL: https://issues.apache.org/jira/browse/SPARK-8205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > nvl(T value, T default_value): T > Returns default value if value is null else returns value (as of HIve 0.11). > We already have this (called Coalesce). Just need to register an alias for it > in FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8201) conditional function: if
[ https://issues.apache.org/jira/browse/SPARK-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8201. Resolution: Fixed Fix Version/s: 1.5.0 > conditional function: if > > > Key: SPARK-8201 > URL: https://issues.apache.org/jira/browse/SPARK-8201 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > We already have an If expression. Just need to register it in > FunctionRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.
[ https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7824. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6351 [https://github.com/apache/spark/pull/6351] > Collapsing operator reordering and constant folding into a single batch to > push down the single side. > - > > Key: SPARK-7824 > URL: https://issues.apache.org/jira/browse/SPARK-7824 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Zhongshuai Pei > Fix For: 1.5.0 > > > SQL: > {noformat} > select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e) > {noformat} > Plan before modify > {noformat} > == Optimized Logical Plan == > Project [a#293,b#294,c#295,d#296,e#297] > Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297 > MetastoreRelation default, tablea, None > MetastoreRelation default, tableb, None > {noformat} > Plan after modify > {noformat} > == Optimized Logical Plan == > Project [a#293,b#294,c#295,d#296,e#297] > Join Inner, Some(((b#294 = d#296) || (b#294 = e#297))) > Filter (a#293 > 3) >MetastoreRelation default, tablea, None > MetastoreRelation default, tableb, None > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.
[ https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7824: Assignee: Zhongshuai Pei > Collapsing operator reordering and constant folding into a single batch to > push down the single side. > - > > Key: SPARK-7824 > URL: https://issues.apache.org/jira/browse/SPARK-7824 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Zhongshuai Pei >Assignee: Zhongshuai Pei > Fix For: 1.5.0 > > > SQL: > {noformat} > select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e) > {noformat} > Plan before modify > {noformat} > == Optimized Logical Plan == > Project [a#293,b#294,c#295,d#296,e#297] > Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297 > MetastoreRelation default, tablea, None > MetastoreRelation default, tableb, None > {noformat} > Plan after modify > {noformat} > == Optimized Logical Plan == > Project [a#293,b#294,c#295,d#296,e#297] > Join Inner, Some(((b#294 = d#296) || (b#294 = e#297))) > Filter (a#293 > 3) >MetastoreRelation default, tablea, None > MetastoreRelation default, tableb, None > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized
[ https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582707#comment-14582707 ] Apache Spark commented on SPARK-7780: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/6771 > The intercept in LogisticRegressionWithLBFGS should not be regularized > -- > > Key: SPARK-7780 > URL: https://issues.apache.org/jira/browse/SPARK-7780 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: DB Tsai > > The intercept in Logistic Regression represents a prior on categories which > should not be regularized. In MLlib, the regularization is handled through > `Updater`, and the `Updater` penalizes all the components without excluding > the intercept which resulting poor training accuracy with regularization. > The new implementation in ML framework handles this properly, and we should > call the implementation in ML from MLlib since majority of users are still > using MLlib api. > Note that both of them are doing feature scalings to improve the convergence, > and the only difference is ML version doesn't regularize the intercept. As a > result, when lambda is zero, they will converge to the same solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8316) Upgrade Maven to 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8316: --- Assignee: Apache Spark > Upgrade Maven to 3.3.3 > -- > > Key: SPARK-8316 > URL: https://issues.apache.org/jira/browse/SPARK-8316 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Minor > > Maven versions prior to 3.3 apparently have some bugs. > See: https://github.com/apache/spark/pull/6492#issuecomment-111001101 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8316) Upgrade Maven to 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582701#comment-14582701 ] Apache Spark commented on SPARK-8316: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/6770 > Upgrade Maven to 3.3.3 > -- > > Key: SPARK-8316 > URL: https://issues.apache.org/jira/browse/SPARK-8316 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas >Priority: Minor > > Maven versions prior to 3.3 apparently have some bugs. > See: https://github.com/apache/spark/pull/6492#issuecomment-111001101 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8316) Upgrade Maven to 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8316: --- Assignee: (was: Apache Spark) > Upgrade Maven to 3.3.3 > -- > > Key: SPARK-8316 > URL: https://issues.apache.org/jira/browse/SPARK-8316 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas >Priority: Minor > > Maven versions prior to 3.3 apparently have some bugs. > See: https://github.com/apache/spark/pull/6492#issuecomment-111001101 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8317) Do not push sort into shuffle in Exchange operator
Josh Rosen created SPARK-8317: - Summary: Do not push sort into shuffle in Exchange operator Key: SPARK-8317 URL: https://issues.apache.org/jira/browse/SPARK-8317 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this: - Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side. - By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted. - We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance
[ https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582691#comment-14582691 ] Tarek Auel edited comment on SPARK-8301 at 6/11/15 11:45 PM: - Another approach could be: (0 until b.length).forall(( i ) => b( i ) == bytes( i )) In theory this could be parallelised, was (Author: tarekauel): Another approach could be: (0 until b.length).forall((i) => b(i) == bytes(i)) In theory this could be parallelised, > Improve UTF8String substring/startsWith/endsWith/contains performance > - > > Key: SPARK-8301 > URL: https://issues.apache.org/jira/browse/SPARK-8301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Many functions in UTF8String are unnecessarily expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance
[ https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582691#comment-14582691 ] Tarek Auel commented on SPARK-8301: --- Another approach could be: (0 until b.length).forall((i) => b(i) == bytes(i)) In theory this could be parallelised, > Improve UTF8String substring/startsWith/endsWith/contains performance > - > > Key: SPARK-8301 > URL: https://issues.apache.org/jira/browse/SPARK-8301 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Many functions in UTF8String are unnecessarily expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8316) Upgrade Maven to 3.3.3
Nicholas Chammas created SPARK-8316: --- Summary: Upgrade Maven to 3.3.3 Key: SPARK-8316 URL: https://issues.apache.org/jira/browse/SPARK-8316 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor Maven versions prior to 3.3 apparently have some bugs. See: https://github.com/apache/spark/pull/6492#issuecomment-111001101 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582684#comment-14582684 ] Apache Spark commented on SPARK-7157: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6769 > Add approximate stratified sampling to DataFrame > > > Key: SPARK-7157 > URL: https://issues.apache.org/jira/browse/SPARK-7157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Minor > > def sampleBy(c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7157: --- Assignee: Xiangrui Meng (was: Apache Spark) > Add approximate stratified sampling to DataFrame > > > Key: SPARK-7157 > URL: https://issues.apache.org/jira/browse/SPARK-7157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Minor > > def sampleBy(c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org