[jira] [Assigned] (SPARK-8187) date/time function: date_sub

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8187:
---

Assignee: (was: Apache Spark)

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_sub(string startdate, int days): string
> date_sub(date startdate, int days): date
> Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
> '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8185) date/time function: datediff

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8185:
---

Assignee: (was: Apache Spark)

> date/time function: datediff
> 
>
> Key: SPARK-8185
> URL: https://issues.apache.org/jira/browse/SPARK-8185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> datediff(date enddate, date startdate): int
> Returns the number of days from startdate to enddate: datediff('2009-03-01', 
> '2009-02-27') = 2.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8185) date/time function: datediff

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583058#comment-14583058
 ] 

Apache Spark commented on SPARK-8185:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6782

> date/time function: datediff
> 
>
> Key: SPARK-8185
> URL: https://issues.apache.org/jira/browse/SPARK-8185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> datediff(date enddate, date startdate): int
> Returns the number of days from startdate to enddate: datediff('2009-03-01', 
> '2009-02-27') = 2.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8186) date/time function: date_add

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8186:
---

Assignee: Apache Spark

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> date_add(string startdate, int days): string
> date_add(date startdate, int days): date
> Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8187) date/time function: date_sub

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583060#comment-14583060
 ] 

Apache Spark commented on SPARK-8187:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6782

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_sub(string startdate, int days): string
> date_sub(date startdate, int days): date
> Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
> '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8186) date/time function: date_add

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583059#comment-14583059
 ] 

Apache Spark commented on SPARK-8186:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6782

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_add(string startdate, int days): string
> date_add(date startdate, int days): date
> Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8186) date/time function: date_add

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8186:
---

Assignee: (was: Apache Spark)

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> date_add(string startdate, int days): string
> date_add(date startdate, int days): date
> Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8187) date/time function: date_sub

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8187:
---

Assignee: Apache Spark

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> date_sub(string startdate, int days): string
> date_sub(date startdate, int days): date
> Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
> '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8185) date/time function: datediff

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8185:
---

Assignee: Apache Spark

> date/time function: datediff
> 
>
> Key: SPARK-8185
> URL: https://issues.apache.org/jira/browse/SPARK-8185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> datediff(date enddate, date startdate): int
> Returns the number of days from startdate to enddate: datediff('2009-03-01', 
> '2009-02-27') = 2.
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7284:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Update streaming documentation for Spark 1.4.0 release
> --
>
> Key: SPARK-7284
> URL: https://issues.apache.org/jira/browse/SPARK-7284
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Things to update (continuously updated list)
> - Python API for Kafka Direct
> - Pointers to the new Streaming UI
> - Update Kafka version to 0.8.2.1
> - Add ref to RDD.foreachPartitionWithIndex (if merged)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583054#comment-14583054
 ] 

Apache Spark commented on SPARK-7284:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6781

> Update streaming documentation for Spark 1.4.0 release
> --
>
> Key: SPARK-7284
> URL: https://issues.apache.org/jira/browse/SPARK-7284
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Things to update (continuously updated list)
> - Python API for Kafka Direct
> - Pointers to the new Streaming UI
> - Update Kafka version to 0.8.2.1
> - Add ref to RDD.foreachPartitionWithIndex (if merged)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7284:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Update streaming documentation for Spark 1.4.0 release
> --
>
> Key: SPARK-7284
> URL: https://issues.apache.org/jira/browse/SPARK-7284
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Things to update (continuously updated list)
> - Python API for Kafka Direct
> - Pointers to the new Streaming UI
> - Update Kafka version to 0.8.2.1
> - Add ref to RDD.foreachPartitionWithIndex (if merged)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7289) Combine Limit and Sort to avoid total ordering

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583042#comment-14583042
 ] 

Apache Spark commented on SPARK-7289:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6780

> Combine Limit and Sort to avoid total ordering
> --
>
> Key: SPARK-7289
> URL: https://issues.apache.org/jira/browse/SPARK-7289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Fei Wang
>
> Optimize following sql
> select key from (select * from testData order by key) t limit 5
> from 
> == Parsed Logical Plan ==
> 'Limit 5
>  'Project ['key]
>   'Subquery t
>'Sort ['key ASC], true
> 'Project [*]
>  'UnresolvedRelation [testData], None
> == Analyzed Logical Plan ==
> Limit 5
>  Project [key#0]
>   Subquery t
>Sort [key#0 ASC], true
> Project [key#0,value#1]
>  Subquery testData
>   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
> == Optimized Logical Plan ==
> Limit 5
>  Project [key#0]
>   Sort [key#0 ASC], true
>LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
> == Physical Plan ==
> Limit 5
>  Project [key#0]
>   Sort [key#0 ASC], true
>Exchange (RangePartitioning [key#0 ASC], 5), []
> PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 
> to
> == Parsed Logical Plan ==
> 'Limit 5
>  'Project ['key]
>   'Subquery t
>'Sort ['key ASC], true
> 'Project [*]
>  'UnresolvedRelation [testData], None
> == Analyzed Logical Plan ==
> Limit 5
>  Project [key#0]
>   Subquery t
>Sort [key#0 ASC], true
> Project [key#0,value#1]
>  Subquery testData
>   LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
> == Optimized Logical Plan ==
> Project [key#0]
>  Limit 5
>   Sort [key#0 ASC], true
>LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
> == Physical Plan ==
> Project [key#0]
>  TakeOrdered 5, [key#0 ASC]
>   PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7267) Push down Project when it's child is Limit

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583041#comment-14583041
 ] 

Apache Spark commented on SPARK-7267:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6780

> Push down Project when it's child is Limit 
> ---
>
> Key: SPARK-7267
> URL: https://issues.apache.org/jira/browse/SPARK-7267
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Zhongshuai Pei
>Assignee: Zhongshuai Pei
>Priority: Critical
> Fix For: 1.4.0
>
>
> SQL
> {quote}
> select key from (select key,value from t1 limit 100) t2 limit 10
> {quote}
> Optimized Logical Plan before modifying
> {quote}
> == Optimized Logical Plan ==
> Limit 10
>  Project [key#228]
>   Limit 100
>MetastoreRelation default, t1, None
> {quote}
> Optimized Logical Plan after modifying
> {quote}
> == Optimized Logical Plan ==
> Limit 10
>  Limit 100
>Project [key#228]
> MetastoreRelation default, t1, None
> {quote}
> After this, we can  combine limits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8234) misc function: md5

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583034#comment-14583034
 ] 

Apache Spark commented on SPARK-8234:
-

User 'qiansl127' has created a pull request for this issue:
https://github.com/apache/spark/pull/6779

> misc function: md5
> --
>
> Key: SPARK-8234
> URL: https://issues.apache.org/jira/browse/SPARK-8234
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> md5(string/binary): string
> Calculates an MD5 128-bit checksum for the string or binary (as of Hive 
> 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the 
> argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8234) misc function: md5

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8234:
---

Assignee: (was: Apache Spark)

> misc function: md5
> --
>
> Key: SPARK-8234
> URL: https://issues.apache.org/jira/browse/SPARK-8234
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> md5(string/binary): string
> Calculates an MD5 128-bit checksum for the string or binary (as of Hive 
> 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the 
> argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8234) misc function: md5

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8234:
---

Assignee: Apache Spark

> misc function: md5
> --
>
> Key: SPARK-8234
> URL: https://issues.apache.org/jira/browse/SPARK-8234
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> md5(string/binary): string
> Calculates an MD5 128-bit checksum for the string or binary (as of Hive 
> 1.3.0). The value is returned as a string of 32 hex digits, or NULL if the 
> argument was NULL. Example: md5('ABC') = '902fbdd2b1df0c4f70b4a5d23525e932'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8323:
---

Assignee: (was: Apache Spark)

> Remove mapOutputTracker field in TaskSchedulerImpl
> --
>
> Key: SPARK-8323
> URL: https://issues.apache.org/jira/browse/SPARK-8323
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: patrickliu
>
> Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in 
> TaskSetManager.
> I think we could remove the mapOutputTracker field in the TaskSchedulerImpl 
> class.
> Instead, we could reference the mapOutputTracker from SparkEnv directly in 
> TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583033#comment-14583033
 ] 

Apache Spark commented on SPARK-8323:
-

User 'yufan-liu' has created a pull request for this issue:
https://github.com/apache/spark/pull/6778

> Remove mapOutputTracker field in TaskSchedulerImpl
> --
>
> Key: SPARK-8323
> URL: https://issues.apache.org/jira/browse/SPARK-8323
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: patrickliu
>
> Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in 
> TaskSetManager.
> I think we could remove the mapOutputTracker field in the TaskSchedulerImpl 
> class.
> Instead, we could reference the mapOutputTracker from SparkEnv directly in 
> TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8323:
---

Assignee: Apache Spark

> Remove mapOutputTracker field in TaskSchedulerImpl
> --
>
> Key: SPARK-8323
> URL: https://issues.apache.org/jira/browse/SPARK-8323
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: patrickliu
>Assignee: Apache Spark
>
> Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in 
> TaskSetManager.
> I think we could remove the mapOutputTracker field in the TaskSchedulerImpl 
> class.
> Instead, we could reference the mapOutputTracker from SparkEnv directly in 
> TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8323) Remove mapOutputTracker field in TaskSchedulerImpl

2015-06-11 Thread patrickliu (JIRA)
patrickliu created SPARK-8323:
-

 Summary: Remove mapOutputTracker field in TaskSchedulerImpl
 Key: SPARK-8323
 URL: https://issues.apache.org/jira/browse/SPARK-8323
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Reporter: patrickliu


Because TaskSchedulerImpl's mapOutputTracker field is only referenced once in 
TaskSetManager.
I think we could remove the mapOutputTracker field in the TaskSchedulerImpl 
class.
Instead, we could reference the mapOutputTracker from SparkEnv directly in 
TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-06-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6566.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5889
[https://github.com/apache/spark/pull/5889]

> Update Spark to use the latest version of Parquet libraries
> ---
>
> Key: SPARK-6566
> URL: https://issues.apache.org/jira/browse/SPARK-6566
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
> Fix For: 1.5.0
>
>
> There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). 
> E.g. PARQUET-136
> It would be good to update Spark to use the latest parquet version.
> The following changes are required:
> {code}
> diff --git a/pom.xml b/pom.xml
> index 5ad39a9..095b519 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -132,7 +132,7 @@
>  
>  0.13.1
>  10.10.1.1
> -1.6.0rc3
> +1.6.0rc7
>  1.2.3
>  8.1.14.v20131031
>  3.0.0.v201112011016
> {code}
> and
> {code}
> --- 
> a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
> +++ 
> b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
> @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
>  globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
>mergedMetadata, globalMetaData.getCreatedBy)
>  
> -val readContext = getReadSupport(configuration).init(
> +val readContext = 
> ParquetInputFormat.getReadSupportInstance(configuration).init(
>new InitContext(configuration,
>  globalMetaData.getKeyValueMetaData,
>  globalMetaData.getSchema))
> {code}
> I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors

2015-06-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582977#comment-14582977
 ] 

Shivaram Venkataraman commented on SPARK-8311:
--

Yeah it looks very similar. I'll close this and follow 8057

> saveAsTextFile with Hadoop1 could lead to errors
> 
>
> Key: SPARK-8311
> URL: https://issues.apache.org/jira/browse/SPARK-8311
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Shivaram Venkataraman
>
> I've run into this bug a couple of times and wanted to document things I have 
> found so far in a JIRA. From what I see if an application is linked to 
> Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the 
> saveAsTextFile call consistently fails with errors of the form
> {code}
> 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 
> (TID 13, ip-10-212-141-222.us-west-2.compute.internal): 
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> {code}
> This does not happen in 1.2.1
> I think the bug is caused by the following commit
> https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240
>  where we the function `commitTask` assumes that the mrTaskContext is always 
> a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1.  But 
> this is just a hypothesis as I haven't tried reverting this to see if the 
> problem goes away
> cc [~liancheng]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Mark Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582971#comment-14582971
 ] 

Mark Smith commented on SPARK-8322:
---

This is the backport to branch-1.4

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582970#comment-14582970
 ] 

Apache Spark commented on SPARK-8322:
-

User 'markmsmith' has created a pull request for this issue:
https://github.com/apache/spark/pull/6777

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7862) Query would hang when the using script has error output in SparkSQL

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7862:

Assignee: zhichao-li

> Query would hang when the using script has error output in SparkSQL
> ---
>
> Key: SPARK-7862
> URL: https://issues.apache.org/jira/browse/SPARK-7862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
>Assignee: zhichao-li
> Fix For: 1.5.0
>
>
> Steps to reproduce:
> val data = (1 to 10).map { i => (i, i, i) }
> data.toDF("d1", "d2", "d3").registerTempTable("script_trans")
>  sql("SELECT TRANSFORM (d1, d2, d3) USING 'cat 1>&2' AS (a,b,c) FROM 
> script_trans")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-06-11 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582965#comment-14582965
 ] 

Peng Cheng commented on SPARK-7442:
---

Still not fixed in 1.4.0 ...
reverting to hadoop 2.4 until this is resolved.

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7862) Query would hang when the using script has error output in SparkSQL

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7862.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6404
[https://github.com/apache/spark/pull/6404]

> Query would hang when the using script has error output in SparkSQL
> ---
>
> Key: SPARK-7862
> URL: https://issues.apache.org/jira/browse/SPARK-7862
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: zhichao-li
> Fix For: 1.5.0
>
>
> Steps to reproduce:
> val data = (1 to 10).map { i => (i, i, i) }
> data.toDF("d1", "d2", "d3").registerTempTable("script_trans")
>  sql("SELECT TRANSFORM (d1, d2, d3) USING 'cat 1>&2' AS (a,b,c) FROM 
> script_trans")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Mark Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Smith updated SPARK-8322:
--
Comment: was deleted

(was: This should probably also be back-ported from master to the 1.4 branch, 
but I haven't made a pull request for that.)

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8317) Do not push sort into shuffle in Exchange operator

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8317.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6772
[https://github.com/apache/spark/pull/6772]

> Do not push sort into shuffle in Exchange operator
> --
>
> Key: SPARK-8317
> URL: https://issues.apache.org/jira/browse/SPARK-8317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.0
>
>
> In some cases, Spark SQL pushes sorting operations into the shuffle layer by 
> specifying a key ordering as part of the shuffle dependency. I think that we 
> should not do this:
> - Since we do not delegate aggregation to Spark's shuffle, specifying the 
> keyOrdering as part of the shuffle has no effect on the shuffle map side.
> - By performing the shuffle ourselves (by inserting a sort operator after the 
> shuffle instead), we can use the Exchange planner to choose specialized 
> sorting implementations based on the types of rows being sorted.
> - We can remove some complexity from SqlSerializer2 by not requiring it to 
> know about sort orderings, since SQL's own sort operators will already 
> perform the necessary defensive copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Mark Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582948#comment-14582948
 ] 

Mark Smith commented on SPARK-8322:
---

This should probably also be back-ported from master to the 1.4 branch, but I 
haven't made a pull request for that.

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8322:
---

Assignee: Apache Spark

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>Assignee: Apache Spark
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Mark Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Smith updated SPARK-8322:
--
Target Version/s:   (was: 1.4.0)
   Fix Version/s: (was: 1.4.0)

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582942#comment-14582942
 ] 

Apache Spark commented on SPARK-8322:
-

User 'markmsmith' has created a pull request for this issue:
https://github.com/apache/spark/pull/6776

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8322:
---

Assignee: (was: Apache Spark)

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582937#comment-14582937
 ] 

Sean Owen commented on SPARK-8322:
--

Related to SPARK-8310. You'll probably want a PR for both master and 1.4 here. 
CC [~shivaram]

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>  Labels: easyfix
> Fix For: 1.4.0
>
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-11 Thread Mark Smith (JIRA)
Mark Smith created SPARK-8322:
-

 Summary: EC2 script not fully updated for 1.4.0 release
 Key: SPARK-8322
 URL: https://issues.apache.org/jira/browse/SPARK-8322
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Mark Smith
 Fix For: 1.4.0


In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to the 
VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to break 
for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors

2015-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8311.
--
Resolution: Duplicate

Yes 95% sure that's a duplicate

> saveAsTextFile with Hadoop1 could lead to errors
> 
>
> Key: SPARK-8311
> URL: https://issues.apache.org/jira/browse/SPARK-8311
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Shivaram Venkataraman
>
> I've run into this bug a couple of times and wanted to document things I have 
> found so far in a JIRA. From what I see if an application is linked to 
> Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the 
> saveAsTextFile call consistently fails with errors of the form
> {code}
> 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 
> (TID 13, ip-10-212-141-222.us-west-2.compute.internal): 
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> {code}
> This does not happen in 1.2.1
> I think the bug is caused by the following commit
> https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240
>  where we the function `commitTask` assumes that the mrTaskContext is always 
> a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1.  But 
> this is just a hypothesis as I haven't tried reverting this to see if the 
> problem goes away
> cc [~liancheng]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582929#comment-14582929
 ] 

Sean Owen commented on SPARK-8318:
--

Minor, but doesn't Component + label = starter already capture that? instead of 
having to maintain and eventually resolve (?) another JIRA 

> Spark Streaming Starter JIRAs
> -
>
> Key: SPARK-8318
> URL: https://issues.apache.org/jira/browse/SPARK-8318
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> This is a master JIRA to collect together all starter tasks related to Spark 
> Streaming. These are simple tasks that contributors can do to get familiar 
> with the process of contributing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2015-06-11 Thread Sunil (JIRA)
Sunil created SPARK-8321:


 Summary: Authorization Support(on all operations not only DDL) in 
Spark Sql
 Key: SPARK-8321
 URL: https://issues.apache.org/jira/browse/SPARK-8321
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.3.0
Reporter: Sunil


Currently If you run Spark SQL with thrift server it only support 
Authentication and limited authorization support(DDL). Want to extend it to 
provide full authorization or provide a plug able authorization like Apache 
sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2015-06-11 Thread Patrick Grandjean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582871#comment-14582871
 ] 

Patrick Grandjean commented on SPARK-7768:
--

Registering UDTs for existing classes would be the perfect solution for 
SPARK-6875 (https://issues.apache.org/jira/browse/SPARK-6875)

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py

2015-06-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582857#comment-14582857
 ] 

Joseph K. Bradley commented on SPARK-8120:
--

Hm, I must have not looked carefully.  Sorry about the trouble!  I'll close the 
JIRA.

> Typos in warning message in sql/types.py
> 
>
> Key: SPARK-8120
> URL: https://issues.apache.org/jira/browse/SPARK-8120
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> See 
> [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093]
> Need to fix string concat + use of %



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8120) Typos in warning message in sql/types.py

2015-06-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8120.

Resolution: Not A Problem

> Typos in warning message in sql/types.py
> 
>
> Key: SPARK-8120
> URL: https://issues.apache.org/jira/browse/SPARK-8120
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> See 
> [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093]
> Need to fix string concat + use of %



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8240) string function: concat

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8240:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8240) string function: concat

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8240:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8240) string function: concat

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582849#comment-14582849
 ] 

Apache Spark commented on SPARK-8240:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6775

> string function: concat
> ---
>
> Key: SPARK-8240
> URL: https://issues.apache.org/jira/browse/SPARK-8240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> concat(string|binary A, string|binary B...): string / binary
> Returns the string or bytes resulting from concatenating the strings or bytes 
> passed in as parameters in order. For example, concat('foo', 'bar') results 
> in 'foobar'. Note that this function can take any number of input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8241) string function: concat_ws

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8241:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: concat_ws
> --
>
> Key: SPARK-8241
> URL: https://issues.apache.org/jira/browse/SPARK-8241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> concat_ws(string SEP, string A, string B...): string
> concat_ws(string SEP, array): string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8241) string function: concat_ws

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582850#comment-14582850
 ] 

Apache Spark commented on SPARK-8241:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6775

> string function: concat_ws
> --
>
> Key: SPARK-8241
> URL: https://issues.apache.org/jira/browse/SPARK-8241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> concat_ws(string SEP, string A, string B...): string
> concat_ws(string SEP, array): string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8241) string function: concat_ws

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8241:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: concat_ws
> --
>
> Key: SPARK-8241
> URL: https://issues.apache.org/jira/browse/SPARK-8241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> concat_ws(string SEP, string A, string B...): string
> concat_ws(string SEP, array): string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582832#comment-14582832
 ] 

Apache Spark commented on SPARK-8129:
-

User 'kanzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6774

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode

2015-06-11 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582829#comment-14582829
 ] 

Tathagata Das commented on SPARK-6892:
--

[~hshreedharan] Could you take a look at this? 
I think event logging directory already exists, that is causing this issue. 

> Recovery from checkpoint will also reuse the application id when write 
> eventLog in yarn-cluster mode
> 
>
> Key: SPARK-6892
> URL: https://issues.apache.org/jira/browse/SPARK-6892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: yangping wu
>Priority: Critical
>
> When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  
> I found it will reuse the application id (In my case is 
> application_1428664056212_0016) before falied to write spark eventLog, But 
> now my application id is application_1428664056212_0017,then spark write 
> eventLog will falied, the stacktrace as follow:
> {code}
> 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' 
> failed, java.io.IOException: Target log file already exists 
> (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
> java.io.IOException: Target log file already exists 
> (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
>   at scala.Option.foreach(Option.scala:236)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}
> This exception will cause the job falied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN

2015-06-11 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582796#comment-14582796
 ] 

Saisai Shao commented on SPARK-8297:


OK, thanks [~mridulm80], I will take a try, from my understanding, akka will 
also get notified if connection is abruptly lost, I didn't test it, will take a 
try.

> Scheduler backend is not notified in case node fails in YARN
> 
>
> Key: SPARK-8297
> URL: https://issues.apache.org/jira/browse/SPARK-8297
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
> Environment: Spark on yarn - both client and cluster mode.
>Reporter: Mridul Muralidharan
>Priority: Critical
>
> When a node crashes, yarn detects the failure and notifies spark - but this 
> information is not propagated to scheduler backend (unlike in mesos mode, for 
> example).
> It results in repeated re-execution of stages (due to FetchFailedException on 
> shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8120) Typos in warning message in sql/types.py

2015-06-11 Thread Jihun Kang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582797#comment-14582797
 ] 

Jihun Kang commented on SPARK-8120:
---

I think it works as expected. I had a following output, and there is no errors.

{noformat}
field name __c__ can not be accessed in Python,use position to access it instead
   "use position to access it instead" % name)
{noformat}

> Typos in warning message in sql/types.py
> 
>
> Key: SPARK-8120
> URL: https://issues.apache.org/jira/browse/SPARK-8120
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> See 
> [https://github.com/apache/spark/blob/3ba6fc515d6ea45c281bb81f648a38523be06383/python/pyspark/sql/types.py#L1093]
> Need to fix string concat + use of %



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8319) Update logic related to key ordering in shuffle dependencies

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8319:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Update logic related to key ordering in shuffle dependencies
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8319) Update logic related to key ordering in shuffle dependencies

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582788#comment-14582788
 ] 

Apache Spark commented on SPARK-8319:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6773

> Update logic related to key ordering in shuffle dependencies
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8319) Update logic related to key ordering in shuffle dependencies

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8319:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Update logic related to key ordering in shuffle dependencies
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7158) collect and take return different results

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7158:

Assignee: Cheng Hao

> collect and take return different results
> -
>
> Key: SPARK-7158
> URL: https://issues.apache.org/jira/browse/SPARK-7158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Reported by [~rams]
> {code}
> import java.util.UUID
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(List(1,2,3), 2)
> val schema = StructType(List(StructField("index",IntegerType,true)))
> val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema)
> def id:() => String = () => {UUID.randomUUID().toString()}
> def square:Int => Int = (x: Int) => {x * x}
> val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect 
> the ID to have materialized at this point
> dfWithId.collect()
> //res0: Array[org.apache.spark.sql.Row] = 
> Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], 
> [2,efd061be-e8cc-43fa-956e-cfd6e7355982], 
> [3,79b0baab-627c-4761-af0d-8995b8c5a125])
> val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, 
> IntegerType, col("index")))
> dfWithIdAndSquare.collect()
> //res1: Array[org.apache.spark.sql.Row] = 
> Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], 
> [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], 
> [3,209f269e-207a-4dfd-a186-738be5db2eff,9])
> //why are the IDs in lines 11 and 15 different?
> {code}
> The randomly generated IDs are the same if show (which uses take under the 
> hood) is used instead of collect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8319) Update logic related to key ordering in shuffle dependencies

2015-06-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8319:
--
Summary: Update logic related to key ordering in shuffle dependencies  
(was: Update several pieces of shuffle logic related to key orderings)

> Update logic related to key ordering in shuffle dependencies
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7158) collect and take return different results

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7158.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5714
[https://github.com/apache/spark/pull/5714]

> collect and take return different results
> -
>
> Key: SPARK-7158
> URL: https://issues.apache.org/jira/browse/SPARK-7158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Reported by [~rams]
> {code}
> import java.util.UUID
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(List(1,2,3), 2)
> val schema = StructType(List(StructField("index",IntegerType,true)))
> val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema)
> def id:() => String = () => {UUID.randomUUID().toString()}
> def square:Int => Int = (x: Int) => {x * x}
> val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect 
> the ID to have materialized at this point
> dfWithId.collect()
> //res0: Array[org.apache.spark.sql.Row] = 
> Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], 
> [2,efd061be-e8cc-43fa-956e-cfd6e7355982], 
> [3,79b0baab-627c-4761-af0d-8995b8c5a125])
> val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, 
> IntegerType, col("index")))
> dfWithIdAndSquare.collect()
> //res1: Array[org.apache.spark.sql.Row] = 
> Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], 
> [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], 
> [3,209f269e-207a-4dfd-a186-738be5db2eff,9])
> //why are the IDs in lines 11 and 15 different?
> {code}
> The randomly generated IDs are the same if show (which uses take under the 
> hood) is used instead of collect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-06-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582768#comment-14582768
 ] 

Alexander Ulanov commented on SPARK-5575:
-

Hi Janani, 

There is already an implemenation of DBN (and RBM) by [~gq]. You can find it 
here: https://github.com/witgo/spark/tree/ann-interface-gemm-dbn

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8319) Update several pieces of shuffle logic related to key orderings

2015-06-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8319:
--
Summary: Update several pieces of shuffle logic related to key orderings  
(was: Enable Tungsten shuffle manager for some shuffles that specify key 
orderings)

> Update several pieces of shuffle logic related to key orderings
> ---
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings

2015-06-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8319:
--
Description: 
The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
the shuffle dependency specifies a key ordering, but technically we only need 
to fall back when an aggregator is also specified.  We should update the 
fallback logic to handle this case so that the Tungsten optimizations can apply 
to more workloads.

I also noticed that the SQL Exchange operator performs defensive copying of 
shuffle inputs when a key ordering is specified, but this is unnecessary: the 
only shuffle manager that performs sorting on the map side is 
SortShuffleManager, and it only performs sorting if an aggregator is specified. 
 SQL never uses Spark's shuffle for performing aggregation, so this copying is 
unnecessary.

  was:
The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
the shuffle dependency specifies a key ordering, but technically we only need 
to fall back when an aggregator is also specified.  We should update the 
fallback logic to handle this case so that the Tungsten optimizations can apply 
to more workloads.

I also noticed that the SQL Exchange operator performs defensive copying of 
shuffle inputs when a key ordering is specified, but this is unnecessary: 


> Enable Tungsten shuffle manager for some shuffles that specify key orderings
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: the 
> only shuffle manager that performs sorting on the map side is 
> SortShuffleManager, and it only performs sorting if an aggregator is 
> specified.  SQL never uses Spark's shuffle for performing aggregation, so 
> this copying is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings

2015-06-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8319:
--
Description: 
The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
the shuffle dependency specifies a key ordering, but technically we only need 
to fall back when an aggregator is also specified.  We should update the 
fallback logic to handle this case so that the Tungsten optimizations can apply 
to more workloads.

I also noticed that the SQL Exchange operator performs defensive copying of 
shuffle inputs when a key ordering is specified, but this is unnecessary: 

  was:The Tungsten ShuffleManager falls back to regular SortShuffleManager 
whenever the shuffle dependency specifies a key ordering, but technically we 
only need to fall back when an aggregator is also specified.  We should update 
the fallback logic to handle this case so that the Tungsten optimizations can 
apply to more workloads.


> Enable Tungsten shuffle manager for some shuffles that specify key orderings
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.
> I also noticed that the SQL Exchange operator performs defensive copying of 
> shuffle inputs when a key ordering is specified, but this is unnecessary: 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8318:
-
Description: This is a master JIRA to collect together all starter tasks 
related to Spark Streaming. These are simple tasks that contributors can do to 
get familiar with the process of contributing.  (was: This is a master JIRA to 
collect together all starter tasks related to Spark Streaming)

> Spark Streaming Starter JIRAs
> -
>
> Key: SPARK-8318
> URL: https://issues.apache.org/jira/browse/SPARK-8318
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> This is a master JIRA to collect together all starter tasks related to Spark 
> Streaming. These are simple tasks that contributors can do to get familiar 
> with the process of contributing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings

2015-06-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8319:
--
Component/s: SQL

> Enable Tungsten shuffle manager for some shuffles that specify key orderings
> 
>
> Key: SPARK-8319
> URL: https://issues.apache.org/jira/browse/SPARK-8319
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
> the shuffle dependency specifies a key ordering, but technically we only need 
> to fall back when an aggregator is also specified.  We should update the 
> fallback logic to handle this case so that the Tungsten optimizations can 
> apply to more workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8318:
-
Labels: starter  (was: )

> Spark Streaming Starter JIRAs
> -
>
> Key: SPARK-8318
> URL: https://issues.apache.org/jira/browse/SPARK-8318
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> This is a master JIRA to collect together all starter tasks related to Spark 
> Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8320) Add example in streaming programming guide that shows union of multiple input streams

2015-06-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-8320:


 Summary: Add example in streaming programming guide that shows 
union of multiple input streams
 Key: SPARK-8320
 URL: https://issues.apache.org/jira/browse/SPARK-8320
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Priority: Minor


The section on "Level of Parallelism in Data Receiving" has a Scala and a Java 
example for union of multiple input streams. A python example should be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8319) Enable Tungsten shuffle manager for some shuffles that specify key orderings

2015-06-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8319:
-

 Summary: Enable Tungsten shuffle manager for some shuffles that 
specify key orderings
 Key: SPARK-8319
 URL: https://issues.apache.org/jira/browse/SPARK-8319
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Josh Rosen
Assignee: Josh Rosen


The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever 
the shuffle dependency specifies a key ordering, but technically we only need 
to fall back when an aggregator is also specified.  We should update the 
fallback logic to handle this case so that the Tungsten optimizations can apply 
to more workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-8318:


 Summary: Spark Streaming Starter JIRAs
 Key: SPARK-8318
 URL: https://issues.apache.org/jira/browse/SPARK-8318
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Priority: Minor


This is a master JIRA to collect together all starter tasks related to Spark 
Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8311) saveAsTextFile with Hadoop1 could lead to errors

2015-06-11 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582741#comment-14582741
 ] 

Patrick Wendell commented on SPARK-8311:


Is this related to or the same as SPARK-8057?

> saveAsTextFile with Hadoop1 could lead to errors
> 
>
> Key: SPARK-8311
> URL: https://issues.apache.org/jira/browse/SPARK-8311
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Shivaram Venkataraman
>
> I've run into this bug a couple of times and wanted to document things I have 
> found so far in a JIRA. From what I see if an application is linked to 
> Hadoop1 and running on a Spark 1.3.1 + Hadoop1 cluster then the 
> saveAsTextFile call consistently fails with errors of the form
> {code}
> 15/06/11 19:47:10 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 3.0 
> (TID 13, ip-10-212-141-222.us-west-2.compute.internal): 
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
> {code}
> This does not happen in 1.2.1
> I think the bug is caused by the following commit
> https://github.com/apache/spark/commit/fde6945417355ae57500b67d034c9cad4f20d240
>  where we the function `commitTask` assumes that the mrTaskContext is always 
> a `mapreduce.TaskContext` while it is a `mapred.TaskContext` in Hadoop1.  But 
> this is just a hypothesis as I haven't tried reverting this to see if the 
> problem goes away
> cc [~liancheng]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos

2015-06-11 Thread Jesika Haria (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582720#comment-14582720
 ] 

Jesika Haria edited comment on SPARK-5493 at 6/12/15 12:20 AM:
---

Trying to support impersonation with the pyspark. It works with the proxy-user 
flag set on the command line:
{code}
pyspark --master yarn-client --proxy-user foo
{code}
However, I actually need to set up the Spark Context programmatically via the 
Python API, but could find no documentation for this. Is pyspark impersonation 
via proxy-user even supported at this time? In the absence of this 
functionality, what is the recommended way of supporting impersonation 
(especially if setting the HADOOP_PROXY_USER env variable is discouraged in 
production)? 

Or if there is a spark config property that corresponds to the proxy-user flag, 
that would be great too (cannot see one at 
https://spark.apache.org/docs/latest/configuration.html) 


was (Author: jesika):
Trying to support impersonation with the pyspark. It works with the proxy-user 
flag set on the command line:
{code}
pyspark --master yarn-client --proxy-user foo
{code}
However, I actually need to set up the Spark Context programmatically via the 
Python API, but could find no documentation for this. Is pyspark impersonation 
via proxy-user even supported at this time? In the absence of this 
functionality, what is the recommended way of supporting impersonation 
(especially if setting the HADOOP_PROXY_USER env variable is discouraged in 
production)? 

> Support proxy users under kerberos
> --
>
> Key: SPARK-5493
> URL: https://issues.apache.org/jira/browse/SPARK-5493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Brock Noland
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> When using kerberos, services may want to use spark-submit to submit jobs as 
> a separate user. For example a service like hive might want to submit jobs as 
> a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8317) Do not push sort into shuffle in Exchange operator

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582724#comment-14582724
 ] 

Apache Spark commented on SPARK-8317:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6772

> Do not push sort into shuffle in Exchange operator
> --
>
> Key: SPARK-8317
> URL: https://issues.apache.org/jira/browse/SPARK-8317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In some cases, Spark SQL pushes sorting operations into the shuffle layer by 
> specifying a key ordering as part of the shuffle dependency. I think that we 
> should not do this:
> - Since we do not delegate aggregation to Spark's shuffle, specifying the 
> keyOrdering as part of the shuffle has no effect on the shuffle map side.
> - By performing the shuffle ourselves (by inserting a sort operator after the 
> shuffle instead), we can use the Exchange planner to choose specialized 
> sorting implementations based on the types of rows being sorted.
> - We can remove some complexity from SqlSerializer2 by not requiring it to 
> know about sort orderings, since SQL's own sort operators will already 
> perform the necessary defensive copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8317) Do not push sort into shuffle in Exchange operator

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8317:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Do not push sort into shuffle in Exchange operator
> --
>
> Key: SPARK-8317
> URL: https://issues.apache.org/jira/browse/SPARK-8317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In some cases, Spark SQL pushes sorting operations into the shuffle layer by 
> specifying a key ordering as part of the shuffle dependency. I think that we 
> should not do this:
> - Since we do not delegate aggregation to Spark's shuffle, specifying the 
> keyOrdering as part of the shuffle has no effect on the shuffle map side.
> - By performing the shuffle ourselves (by inserting a sort operator after the 
> shuffle instead), we can use the Exchange planner to choose specialized 
> sorting implementations based on the types of rows being sorted.
> - We can remove some complexity from SqlSerializer2 by not requiring it to 
> know about sort orderings, since SQL's own sort operators will already 
> perform the necessary defensive copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8317) Do not push sort into shuffle in Exchange operator

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8317:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Do not push sort into shuffle in Exchange operator
> --
>
> Key: SPARK-8317
> URL: https://issues.apache.org/jira/browse/SPARK-8317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In some cases, Spark SQL pushes sorting operations into the shuffle layer by 
> specifying a key ordering as part of the shuffle dependency. I think that we 
> should not do this:
> - Since we do not delegate aggregation to Spark's shuffle, specifying the 
> keyOrdering as part of the shuffle has no effect on the shuffle map side.
> - By performing the shuffle ourselves (by inserting a sort operator after the 
> shuffle instead), we can use the Exchange planner to choose specialized 
> sorting implementations based on the types of rows being sorted.
> - We can remove some complexity from SqlSerializer2 by not requiring it to 
> know about sort orderings, since SQL's own sort operators will already 
> perform the necessary defensive copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2015-06-11 Thread Jesika Haria (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582720#comment-14582720
 ] 

Jesika Haria commented on SPARK-5493:
-

Trying to support impersonation with the pyspark. It works with the proxy-user 
flag set on the command line:
{code}
pyspark --master yarn-client --proxy-user foo
{code}
However, I actually need to set up the Spark Context programmatically via the 
Python API, but could find no documentation for this. Is pyspark impersonation 
via proxy-user even supported at this time? In the absence of this 
functionality, what is the recommended way of supporting impersonation 
(especially if setting the HADOOP_PROXY_USER env variable is discouraged in 
production)? 

> Support proxy users under kerberos
> --
>
> Key: SPARK-5493
> URL: https://issues.apache.org/jira/browse/SPARK-5493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Brock Noland
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> When using kerberos, services may want to use spark-submit to submit jobs as 
> a separate user. For example a service like hive might want to submit jobs as 
> a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8208) math function: ceiling

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8208.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: ceiling
> --
>
> Key: SPARK-8208
> URL: https://issues.apache.org/jira/browse/SPARK-8208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> We already have ceil -- just need to create an alias for it in 
> FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8211) math function: radians

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8211.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: radians
> --
>
> Key: SPARK-8211
> URL: https://issues.apache.org/jira/browse/SPARK-8211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Alias toRadians -> radians in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8251) string function: alias upper / ucase

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8251.

   Resolution: Fixed
Fix Version/s: 1.5.0

> string function: alias upper / ucase
> 
>
> Key: SPARK-8251
> URL: https://issues.apache.org/jira/browse/SPARK-8251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Alias upper / ucase in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8229) conditional function: isnotnull

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8229.

   Resolution: Fixed
Fix Version/s: 1.5.0

> conditional function: isnotnull
> ---
>
> Key: SPARK-8229
> URL: https://issues.apache.org/jira/browse/SPARK-8229
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Just need to register it in the FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8216) math function: rename log -> ln

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8216.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: rename log -> ln
> ---
>
> Key: SPARK-8216
> URL: https://issues.apache.org/jira/browse/SPARK-8216
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Rename expression Log -> Ln.
> Also create aliased data frame functions, and update FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8225) math function: alias sign / signum

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8225.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: alias sign / signum
> --
>
> Key: SPARK-8225
> URL: https://issues.apache.org/jira/browse/SPARK-8225
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Alias them in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8228) conditional function: isnull

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8228.

   Resolution: Fixed
Fix Version/s: 1.5.0

> conditional function: isnull
> 
>
> Key: SPARK-8228
> URL: https://issues.apache.org/jira/browse/SPARK-8228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Just need to register it in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8222) math function: alias power / pow

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8222.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: alias power / pow
> 
>
> Key: SPARK-8222
> URL: https://issues.apache.org/jira/browse/SPARK-8222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Add to FunctionRegistry power.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8219) math function: negative

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8219.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: negative
> ---
>
> Key: SPARK-8219
> URL: https://issues.apache.org/jira/browse/SPARK-8219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> This is just an alias for UnaryMinus. Only add it to FunctionRegistry, and 
> not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8210) math function: degrees

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8210.

   Resolution: Fixed
Fix Version/s: 1.5.0

> math function: degrees
> --
>
> Key: SPARK-8210
> URL: https://issues.apache.org/jira/browse/SPARK-8210
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Alias todegrees -> degrees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8250) string function: alias lower/lcase

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8250.

   Resolution: Fixed
Fix Version/s: 1.5.0

> string function: alias lower/lcase
> --
>
> Key: SPARK-8250
> URL: https://issues.apache.org/jira/browse/SPARK-8250
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Alias lower/lcase in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8205) conditional function: nvl

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8205.

   Resolution: Fixed
Fix Version/s: 1.5.0

> conditional function: nvl
> -
>
> Key: SPARK-8205
> URL: https://issues.apache.org/jira/browse/SPARK-8205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> nvl(T value, T default_value): T
> Returns default value if value is null else returns value (as of HIve 0.11).
> We already have this (called Coalesce). Just need to register an alias for it 
> in FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8201) conditional function: if

2015-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8201.

   Resolution: Fixed
Fix Version/s: 1.5.0

> conditional function: if
> 
>
> Key: SPARK-8201
> URL: https://issues.apache.org/jira/browse/SPARK-8201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> We already have an If expression. Just need to register it in 
> FunctionRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7824.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6351
[https://github.com/apache/spark/pull/6351]

> Collapsing operator reordering and constant folding into a single batch to 
> push down the single side.
> -
>
> Key: SPARK-7824
> URL: https://issues.apache.org/jira/browse/SPARK-7824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Zhongshuai Pei
> Fix For: 1.5.0
>
>
> SQL:
> {noformat}
> select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e)
> {noformat}
> Plan before modify
> {noformat}
> == Optimized Logical Plan ==
> Project [a#293,b#294,c#295,d#296,e#297]
>  Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297
>   MetastoreRelation default, tablea, None
>   MetastoreRelation default, tableb, None
> {noformat}
> Plan after modify
> {noformat}
> == Optimized Logical Plan ==
> Project [a#293,b#294,c#295,d#296,e#297]
>  Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
>   Filter (a#293 > 3)
>MetastoreRelation default, tablea, None
>   MetastoreRelation default, tableb, None
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.

2015-06-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7824:

Assignee: Zhongshuai Pei

> Collapsing operator reordering and constant folding into a single batch to 
> push down the single side.
> -
>
> Key: SPARK-7824
> URL: https://issues.apache.org/jira/browse/SPARK-7824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Zhongshuai Pei
>Assignee: Zhongshuai Pei
> Fix For: 1.5.0
>
>
> SQL:
> {noformat}
> select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e)
> {noformat}
> Plan before modify
> {noformat}
> == Optimized Logical Plan ==
> Project [a#293,b#294,c#295,d#296,e#297]
>  Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297
>   MetastoreRelation default, tablea, None
>   MetastoreRelation default, tableb, None
> {noformat}
> Plan after modify
> {noformat}
> == Optimized Logical Plan ==
> Project [a#293,b#294,c#295,d#296,e#297]
>  Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
>   Filter (a#293 > 3)
>MetastoreRelation default, tablea, None
>   MetastoreRelation default, tableb, None
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7780) The intercept in LogisticRegressionWithLBFGS should not be regularized

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582707#comment-14582707
 ] 

Apache Spark commented on SPARK-7780:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/6771

> The intercept in LogisticRegressionWithLBFGS should not be regularized
> --
>
> Key: SPARK-7780
> URL: https://issues.apache.org/jira/browse/SPARK-7780
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: DB Tsai
>
> The intercept in Logistic Regression represents a prior on categories which 
> should not be regularized. In MLlib, the regularization is handled through 
> `Updater`, and the `Updater` penalizes all the components without excluding 
> the intercept which resulting poor training accuracy with regularization.
> The new implementation in ML framework handles this properly, and we should 
> call the implementation in ML from MLlib since majority of users are still 
> using MLlib api. 
> Note that both of them are doing feature scalings to improve the convergence, 
> and the only difference is ML version doesn't regularize the intercept. As a 
> result, when lambda is zero, they will converge to the same solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8316) Upgrade Maven to 3.3.3

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8316:
---

Assignee: Apache Spark

> Upgrade Maven to 3.3.3
> --
>
> Key: SPARK-8316
> URL: https://issues.apache.org/jira/browse/SPARK-8316
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> Maven versions prior to 3.3 apparently have some bugs.
> See: https://github.com/apache/spark/pull/6492#issuecomment-111001101



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8316) Upgrade Maven to 3.3.3

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582701#comment-14582701
 ] 

Apache Spark commented on SPARK-8316:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/6770

> Upgrade Maven to 3.3.3
> --
>
> Key: SPARK-8316
> URL: https://issues.apache.org/jira/browse/SPARK-8316
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maven versions prior to 3.3 apparently have some bugs.
> See: https://github.com/apache/spark/pull/6492#issuecomment-111001101



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8316) Upgrade Maven to 3.3.3

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8316:
---

Assignee: (was: Apache Spark)

> Upgrade Maven to 3.3.3
> --
>
> Key: SPARK-8316
> URL: https://issues.apache.org/jira/browse/SPARK-8316
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maven versions prior to 3.3 apparently have some bugs.
> See: https://github.com/apache/spark/pull/6492#issuecomment-111001101



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8317) Do not push sort into shuffle in Exchange operator

2015-06-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8317:
-

 Summary: Do not push sort into shuffle in Exchange operator
 Key: SPARK-8317
 URL: https://issues.apache.org/jira/browse/SPARK-8317
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


In some cases, Spark SQL pushes sorting operations into the shuffle layer by 
specifying a key ordering as part of the shuffle dependency. I think that we 
should not do this:

- Since we do not delegate aggregation to Spark's shuffle, specifying the 
keyOrdering as part of the shuffle has no effect on the shuffle map side.
- By performing the shuffle ourselves (by inserting a sort operator after the 
shuffle instead), we can use the Exchange planner to choose specialized sorting 
implementations based on the types of rows being sorted.
- We can remove some complexity from SqlSerializer2 by not requiring it to know 
about sort orderings, since SQL's own sort operators will already perform the 
necessary defensive copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance

2015-06-11 Thread Tarek Auel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582691#comment-14582691
 ] 

Tarek Auel edited comment on SPARK-8301 at 6/11/15 11:45 PM:
-

Another approach could be:

(0 until b.length).forall(( i ) => b( i ) == bytes( i ))

In theory this could be parallelised,


was (Author: tarekauel):
Another approach could be:

(0 until b.length).forall((i) => b(i) == bytes(i))

In theory this could be parallelised,

> Improve UTF8String substring/startsWith/endsWith/contains performance
> -
>
> Key: SPARK-8301
> URL: https://issues.apache.org/jira/browse/SPARK-8301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Many functions in UTF8String are unnecessarily expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8301) Improve UTF8String substring/startsWith/endsWith/contains performance

2015-06-11 Thread Tarek Auel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582691#comment-14582691
 ] 

Tarek Auel commented on SPARK-8301:
---

Another approach could be:

(0 until b.length).forall((i) => b(i) == bytes(i))

In theory this could be parallelised,

> Improve UTF8String substring/startsWith/endsWith/contains performance
> -
>
> Key: SPARK-8301
> URL: https://issues.apache.org/jira/browse/SPARK-8301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Many functions in UTF8String are unnecessarily expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8316) Upgrade Maven to 3.3.3

2015-06-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-8316:
---

 Summary: Upgrade Maven to 3.3.3
 Key: SPARK-8316
 URL: https://issues.apache.org/jira/browse/SPARK-8316
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


Maven versions prior to 3.3 apparently have some bugs.

See: https://github.com/apache/spark/pull/6492#issuecomment-111001101



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame

2015-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582684#comment-14582684
 ] 

Apache Spark commented on SPARK-7157:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6769

> Add approximate stratified sampling to DataFrame
> 
>
> Key: SPARK-7157
> URL: https://issues.apache.org/jira/browse/SPARK-7157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Minor
>
> def sampleBy(c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7157) Add approximate stratified sampling to DataFrame

2015-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7157:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Add approximate stratified sampling to DataFrame
> 
>
> Key: SPARK-7157
> URL: https://issues.apache.org/jira/browse/SPARK-7157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Minor
>
> def sampleBy(c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >