[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2016-02-16 Thread Zhen Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150055#comment-15150055
 ] 

Zhen Peng commented on SPARK-8119:
--

Hi [~srowen], I think it's really a serious bug, do you have any reason for not 
back-porting it to 1.4.x?

> HeartbeatReceiver should not adjust application executor resources
> --
>
> Key: SPARK-8119
> URL: https://issues.apache.org/jira/browse/SPARK-8119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> DynamicAllocation will set the total executor to a little number when it 
> wants to kill some executors.
> But in no-DynamicAllocation scenario, Spark will also set the total executor.
> So it will cause such problem: sometimes an executor fails down, there is no 
> more executor which will be pull up by spark.
> === EDIT by andrewor14 ===
> The issue is that the AM forgets about the original number of executors it 
> wants after calling sc.killExecutor. Even if dynamic allocation is not 
> enabled, this is still possible because of heartbeat timeouts.
> I think the problem is that sc.killExecutor is used incorrectly in 
> HeartbeatReceiver. The intention of the method is to permanently adjust the 
> number of executors the application will get. In HeartbeatReceiver, however, 
> this is used as a best-effort mechanism to ensure that the timed out executor 
> is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150053#comment-15150053
 ] 

Xiao Li commented on SPARK-1:
-

Yeah, the same series of random number. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150050#comment-15150050
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

But when you set deterministic to true, your each data partition will get same 
random values, right?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150042#comment-15150042
 ] 

Xiao Li commented on SPARK-1:
-

Yeah. I realized it when fixing this problem. Thus, in the PR, I just added 
another parameter `deterministic` for rand and randn. If necessary, users can 
set `deterministic` to true.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150008#comment-15150008
 ] 

Liang-Chi Hsieh commented on SPARK-1:
-

If you don't attach a partition id, wouldn't your each data partition have same 
random number?

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13249) Filter null keys for inner join

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149986#comment-15149986
 ] 

Apache Spark commented on SPARK-13249:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11235

> Filter null keys for inner join
> ---
>
> Key: SPARK-13249
> URL: https://issues.apache.org/jira/browse/SPARK-13249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> For inner join, the join key with null in it will not match each other, so we 
> could insert a Filter before inner join (could be pushed down), then we don't 
> need to check nullability of keys while joining.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13249) Filter null keys for inner join

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13249:


Assignee: Apache Spark

> Filter null keys for inner join
> ---
>
> Key: SPARK-13249
> URL: https://issues.apache.org/jira/browse/SPARK-13249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> For inner join, the join key with null in it will not match each other, so we 
> could insert a Filter before inner join (could be pushed down), then we don't 
> need to check nullability of keys while joining.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13249) Filter null keys for inner join

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13249:


Assignee: (was: Apache Spark)

> Filter null keys for inner join
> ---
>
> Key: SPARK-13249
> URL: https://issues.apache.org/jira/browse/SPARK-13249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> For inner join, the join key with null in it will not match each other, so we 
> could insert a Filter before inner join (could be pushed down), then we don't 
> need to check nullability of keys while joining.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13322:

Description: 
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity because we do not standardize the 
feature before fitting model, we should support feature standardization.
Another benefit is that standardization will improve the convergence rate.


  was:
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity because we do not standardize the 
feature before fitting model, we should support feature standardization.



> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13322:

Description: 
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity because we do not standardize the 
feature before fitting model, we should support feature standardization.


  was:
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity because we do not standardize the 
feature before fitting model, we should handle this.



> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13322:

Description: 
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity because we do not standardize the 
feature before fitting model, we should handle this.


  was:
This bug is reported by Stuti Awasthi.
https://www.mail-archive.com/user@spark.apache.org/msg45643.html
The lossSum has possibility of infinity, so we should handle it properly.



> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should handle this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-02-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13322:

Summary: AFTSurvivalRegression should support feature standardization  
(was: AFTSurvivalRegression should handle lossSum infinity)

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity, so we should handle it properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13359) ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6

2016-02-16 Thread Earthson Lu (JIRA)
Earthson Lu created SPARK-13359:
---

 Summary: ArrayType(_, true) should also accept ArrayType(_, false) 
fix for branch-1.6
 Key: SPARK-13359
 URL: https://issues.apache.org/jira/browse/SPARK-13359
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.0
Reporter: Earthson Lu
Priority: Minor
 Fix For: 1.6.1


backport fix for https://issues.apache.org/jira/browse/SPARK-12746



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts

2016-02-16 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149939#comment-15149939
 ] 

Saisai Shao commented on SPARK-13275:
-

would you please clarify the specific problem you mentioned, is it UI problem 
or dynamic allocation problem?

> With dynamic allocation, executors appear to be added before job starts
> ---
>
> Key: SPARK-13275
> URL: https://issues.apache.org/jira/browse/SPARK-13275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Stephanie Bodoff
>Priority: Minor
> Attachments: webui.png
>
>
> When I look at the timeline in the Spark Web UI I see the job starting and 
> then executors being added. The blue lines and dots hitting the timeline show 
> that the executors were added after the job started. But the way the Executor 
> box is rendered it looks like the executors started before the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13354:


Assignee: Apache Spark  (was: Davies Liu)

> Push filter throughout outer join when the condition can filter out empty row 
> --
>
> Key: SPARK-13354
> URL: https://issues.apache.org/jira/browse/SPARK-13354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> For a query
> {code}
> select * from a left outer join b on a.a = b.a where b.b > 10
> {code}
> The condition `b.b > 10` will filter out all the row that the b part of it is 
> empty.
> In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149923#comment-15149923
 ] 

Apache Spark commented on SPARK-13354:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11234

> Push filter throughout outer join when the condition can filter out empty row 
> --
>
> Key: SPARK-13354
> URL: https://issues.apache.org/jira/browse/SPARK-13354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> For a query
> {code}
> select * from a left outer join b on a.a = b.a where b.b > 10
> {code}
> The condition `b.b > 10` will filter out all the row that the b part of it is 
> empty.
> In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13354:


Assignee: Davies Liu  (was: Apache Spark)

> Push filter throughout outer join when the condition can filter out empty row 
> --
>
> Key: SPARK-13354
> URL: https://issues.apache.org/jira/browse/SPARK-13354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> For a query
> {code}
> select * from a left outer join b on a.a = b.a where b.b > 10
> {code}
> The condition `b.b > 10` will filter out all the row that the b part of it is 
> empty.
> In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: (was: Apache Spark)

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149850#comment-15149850
 ] 

Apache Spark commented on SPARK-1:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11232

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: Apache Spark

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13358) Retrieve grep path when doing Benchmark

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13358:


Assignee: (was: Apache Spark)

> Retrieve grep path when doing Benchmark
> ---
>
> Key: SPARK-13358
> URL: https://issues.apache.org/jira/browse/SPARK-13358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13358) Retrieve grep path when doing Benchmark

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13358:


Assignee: Apache Spark

> Retrieve grep path when doing Benchmark
> ---
>
> Key: SPARK-13358
> URL: https://issues.apache.org/jira/browse/SPARK-13358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13358) Retrieve grep path when doing Benchmark

2016-02-16 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-13358:
---

 Summary: Retrieve grep path when doing Benchmark
 Key: SPARK-13358
 URL: https://issues.apache.org/jira/browse/SPARK-13358
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13358) Retrieve grep path when doing Benchmark

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149837#comment-15149837
 ] 

Apache Spark commented on SPARK-13358:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11231

> Retrieve grep path when doing Benchmark
> ---
>
> Key: SPARK-13358
> URL: https://issues.apache.org/jira/browse/SPARK-13358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.

2016-02-16 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149810#comment-15149810
 ] 

SaintBacchus commented on SPARK-12316:
--

[~tgraves] The function of listFilesSorted will not throw the exception, it 
only log the exception. So it will not schedule it an hour later and it will 
schedule it immediately and then go into another loop.

> Stack overflow with endless call of `Delegation token thread` when 
> application end.
> ---
>
> Key: SPARK-12316
> URL: https://issues.apache.org/jira/browse/SPARK-12316
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Attachments: 20151210045149.jpg, 20151210045533.jpg
>
>
> When application end, AM will clean the staging dir.
> But if the driver trigger to update the delegation token, it will can't find 
> the right token file and then it will endless cycle call the method 
> 'updateCredentialsIfRequired'.
> Then it lead to StackOverflowError.
> !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg!
> !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13357:


Assignee: Apache Spark

> Use generated projection and ordering for TakeOrderedAndProjectNode
> ---
>
> Key: SPARK-13357
> URL: https://issues.apache.org/jira/browse/SPARK-13357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> {{TakeOrderedAndProjectNode}} should use generated projection and ordering 
> like other {{LocalNode}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149807#comment-15149807
 ] 

Apache Spark commented on SPARK-13357:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11230

> Use generated projection and ordering for TakeOrderedAndProjectNode
> ---
>
> Key: SPARK-13357
> URL: https://issues.apache.org/jira/browse/SPARK-13357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{TakeOrderedAndProjectNode}} should use generated projection and ordering 
> like other {{LocalNode}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13357:


Assignee: (was: Apache Spark)

> Use generated projection and ordering for TakeOrderedAndProjectNode
> ---
>
> Key: SPARK-13357
> URL: https://issues.apache.org/jira/browse/SPARK-13357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{TakeOrderedAndProjectNode}} should use generated projection and ordering 
> like other {{LocalNode}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode

2016-02-16 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-13357:
-

 Summary: Use generated projection and ordering for 
TakeOrderedAndProjectNode
 Key: SPARK-13357
 URL: https://issues.apache.org/jira/browse/SPARK-13357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{TakeOrderedAndProjectNode}} should use generated projection and ordering like 
other {{LocalNode}} s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149795#comment-15149795
 ] 

Apache Spark commented on SPARK-13220:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11229

> Deprecate "yarn-client" and "yarn-cluster"
> --
>
> Key: SPARK-13220
> URL: https://issues.apache.org/jira/browse/SPARK-13220
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Andrew Or
>Assignee: Saisai Shao
>
> We currently allow `\-\-master yarn-client`. Instead, the user should do 
> `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more 
> consistent with other cluster managers and obviates the need to do special 
> parsing of the master string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13220:


Assignee: Apache Spark  (was: Saisai Shao)

> Deprecate "yarn-client" and "yarn-cluster"
> --
>
> Key: SPARK-13220
> URL: https://issues.apache.org/jira/browse/SPARK-13220
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> We currently allow `\-\-master yarn-client`. Instead, the user should do 
> `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more 
> consistent with other cluster managers and obviates the need to do special 
> parsing of the master string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13220:


Assignee: Saisai Shao  (was: Apache Spark)

> Deprecate "yarn-client" and "yarn-cluster"
> --
>
> Key: SPARK-13220
> URL: https://issues.apache.org/jira/browse/SPARK-13220
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Andrew Or
>Assignee: Saisai Shao
>
> We currently allow `\-\-master yarn-client`. Instead, the user should do 
> `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more 
> consistent with other cluster managers and obviates the need to do special 
> parsing of the master string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-16 Thread Sateesh Babu G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149794#comment-15149794
 ] 

Sateesh Babu G commented on SPARK-9273:
---

Hi Alexander,

Thank you very much for your help!

Can I use any one of the mentioned implementations for CNN regression? Which 
one is more suitable? 

I also found deeplearning4j.org has CNN implementation in Spark but with only 
one Convoultional and one pooling layer. Do you suggest deeplearning4j.org's 
CNN implementation in Spark.

Best,
Sateesh

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11627) Spark Streaming backpressure mechanism has no initial input rate limit,receivers receive data at the maximum speed , it might cause OOM exception

2016-02-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11627:
-
Affects Version/s: 1.6.0

> Spark Streaming backpressure mechanism  has no initial input rate 
> limit,receivers receive data at the maximum speed , it might cause OOM 
> exception
> --
>
> Key: SPARK-11627
> URL: https://issues.apache.org/jira/browse/SPARK-11627
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1, 1.6.0
>Reporter: junhaoMg
>Assignee: junhaoMg
> Fix For: 2.0.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Spark Streaming backpressure mechanism  has no initial input rate limit, 
> receivers receive data at the maximum speed they can reach in the  first 
> batch, the data received might exhaust executors memory resources and  cause 
> out of memory  exception.  Eventually the streaming job would failed, the  
> backpressure mechanism become invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11627) Spark Streaming backpressure mechanism has no initial input rate limit,receivers receive data at the maximum speed , it might cause OOM exception

2016-02-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-11627.
--
   Resolution: Fixed
 Assignee: junhaoMg
Fix Version/s: 2.0.0

> Spark Streaming backpressure mechanism  has no initial input rate 
> limit,receivers receive data at the maximum speed , it might cause OOM 
> exception
> --
>
> Key: SPARK-11627
> URL: https://issues.apache.org/jira/browse/SPARK-11627
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1, 1.6.0
>Reporter: junhaoMg
>Assignee: junhaoMg
> Fix For: 2.0.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Spark Streaming backpressure mechanism  has no initial input rate limit, 
> receivers receive data at the maximum speed they can reach in the  first 
> batch, the data received might exhaust executors memory resources and  cause 
> out of memory  exception.  Eventually the streaming job would failed, the  
> backpressure mechanism become invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13356:


Assignee: Apache Spark

> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: jeanlyn
>Assignee: Apache Spark
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13356:


Assignee: (was: Apache Spark)

> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: jeanlyn
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149727#comment-15149727
 ] 

Apache Spark commented on SPARK-13356:
--

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/11228

> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: jeanlyn
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread jeanlyn (JIRA)
jeanlyn created SPARK-13356:
---

 Summary: WebUI missing input informations when recovering from 
dirver failure
 Key: SPARK-13356
 URL: https://issues.apache.org/jira/browse/SPARK-13356
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0, 1.5.2, 1.5.1, 1.5.0
Reporter: jeanlyn


WebUI missing some input information when streaming recover from checkpoint, it 
may confuse people the data had lose when recover from failure.
For example:
!DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13356:

Attachment: DirectKafkaScreenshot.jpg

> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: jeanlyn
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-16 Thread krishna ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

krishna ramachandran updated SPARK-13349:
-
Description: 
We have a streaming application containing approximately 12 jobs every batch, 
running in streaming mode (4 sec batches). Each job writes output to cassandra

each job can contain several stages.

job 1

---> receive Stream A --> map --> filter -> (union with another stream B) --> 
map --> groupbykey --> transform --> reducebykey --> map

we go thro' few more jobs of transforms and save to database. 

Around stage 5, we union the output of Dstream from job 1 (in red) with another 
stream (generated by split during job 2) and save that state

It appears the whole execution thus far is repeated which is redundant (I can 
see this in execution graph & also performance -> processing time). Processing 
time per batch nearly doubles or triples.

This additional & redundant processing cause each batch to run as much as 2.5 
times slower compared to runs without the union - union for most batches does 
not alter the original DStream (union with an empty set). If I cache the 
DStream from job 1(red block output), performance improves substantially but 
hit out of memory errors within few hours.

What is the recommended way to cache/unpersist in such a scenario? there is no 
dstream level "unpersist"

setting "spark.streaming.unpersist" to true and 
streamingContext.remember("duration") did not help. Still seeing out of memory 
errors

  was:
We have a streaming application containing approximately 12 stages every batch, 
running in streaming mode (4 sec batches). Each stage persists output to 
cassandra

the pipeline stages 
stage 1

---> receive Stream A --> map --> filter -> (union with another stream B) --> 
map --> groupbykey --> transform --> reducebykey --> map

we go thro' few more stages of transforms and save to database. 

Around stage 5, we union the output of Dstream from stage 1 (in red) with 
another stream (generated by split during stage 2) and save that state

It appears the whole execution thus far is repeated which is redundant (I can 
see this in execution graph & also performance -> processing time). Processing 
time per batch nearly doubles or triples.

This additional & redundant processing cause each batch to run as much as 2.5 
times slower compared to runs without the union - union for most batches does 
not alter the original DStream (union with an empty set). If I cache the 
DStream (red block output), performance improves substantially but hit out of 
memory errors within few hours.

What is the recommended way to cache/unpersist in such a scenario? there is no 
dstream level "unpersist"

setting "spark.streaming.unpersist" to true and 
streamingContext.remember("duration") did not help. Still seeing out of memory 
errors


> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
> Fix For: 1.4.2
>
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Assigned] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13355:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Replace GraphImpl.fromExistingRDDs by Graph
> ---
>
> Key: SPARK-13355
> URL: https://issues.apache.org/jira/browse/SPARK-13355
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We 
> call it in LDA without validating this requirement. So it might introduce 
> errors. Replacing it by `Gpaph.apply` would be safer and more proper because 
> it is a public API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149685#comment-15149685
 ] 

Apache Spark commented on SPARK-13355:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/11226

> Replace GraphImpl.fromExistingRDDs by Graph
> ---
>
> Key: SPARK-13355
> URL: https://issues.apache.org/jira/browse/SPARK-13355
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We 
> call it in LDA without validating this requirement. So it might introduce 
> errors. Replacing it by `Gpaph.apply` would be safer and more proper because 
> it is a public API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13355:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Replace GraphImpl.fromExistingRDDs by Graph
> ---
>
> Key: SPARK-13355
> URL: https://issues.apache.org/jira/browse/SPARK-13355
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We 
> call it in LDA without validating this requirement. So it might introduce 
> errors. Replacing it by `Gpaph.apply` would be safer and more proper because 
> it is a public API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph

2016-02-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-13355:
-

 Summary: Replace GraphImpl.fromExistingRDDs by Graph
 Key: SPARK-13355
 URL: https://issues.apache.org/jira/browse/SPARK-13355
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.6.0, 1.5.2, 1.4.1, 1.3.1, 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call 
it in LDA without validating this requirement. So it might introduce errors. 
Replacing it by `Gpaph.apply` would be safer and more proper because it is a 
public API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12776) Implement Python API for Datasets

2016-02-16 Thread Gustavo Salazar Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149621#comment-15149621
 ] 

Gustavo Salazar Torres commented on SPARK-12776:


I will work on some code following what was did at Dataset,scala

> Implement Python API for Datasets
> -
>
> Key: SPARK-12776
> URL: https://issues.apache.org/jira/browse/SPARK-12776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kevin Cox
>Priority: Minor
>
> Now that the Dataset API is in Scala and Java it would be awesome to see it 
> show up in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13354:
--

 Summary: Push filter throughout outer join when the condition can 
filter out empty row 
 Key: SPARK-13354
 URL: https://issues.apache.org/jira/browse/SPARK-13354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu



For a query

{code}

select * from a left outer join b on a.a = b.a where b.b > 10

{code}

The condition `b.b > 10` will filter out all the row that the b part of it is 
empty.

In this case, we should use Inner join, and push down the filter into b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13353) Use UnsafeRowSerializer to collect DataFrame

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13353:
--

 Summary: Use UnsafeRowSerializer to collect DataFrame
 Key: SPARK-13353
 URL: https://issues.apache.org/jira/browse/SPARK-13353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


UnsafeRowSerializer should be more efficient than JavaSerializer or 
KyroSerializer for DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13351) Column pruning fails on expand

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13351:


Assignee: Apache Spark  (was: Davies Liu)

> Column pruning fails on expand
> --
>
> Key: SPARK-13351
> URL: https://issues.apache.org/jira/browse/SPARK-13351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Optimizer can't pruning the columns in Expand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13351) Column pruning fails on expand

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149571#comment-15149571
 ] 

Apache Spark commented on SPARK-13351:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11225

> Column pruning fails on expand
> --
>
> Key: SPARK-13351
> URL: https://issues.apache.org/jira/browse/SPARK-13351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Optimizer can't pruning the columns in Expand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13351) Column pruning fails on expand

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13351:


Assignee: Davies Liu  (was: Apache Spark)

> Column pruning fails on expand
> --
>
> Key: SPARK-13351
> URL: https://issues.apache.org/jira/browse/SPARK-13351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Optimizer can't pruning the columns in Expand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13352) BlockFetch does not scale well on large block

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13352:
--

 Summary: BlockFetch does not scale well on large block
 Key: SPARK-13352
 URL: https://issues.apache.org/jira/browse/SPARK-13352
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


BlockManager.getRemoteBytes() perform poorly on large block

{code}
  test("block manager") {
val N = 500 << 20
val bm = sc.env.blockManager
val blockId = TaskResultBlockId(0)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
val result = bm.getRemoteBytes(blockId)
assert(result.isDefined)
assert(result.get.limit() === (N))
  }
{code}

Here are runtime for different block sizes:
{code}
50M3 seconds
100M  7 seconds
250M  33 seconds
500M 2 min
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13351) Column pruning fails on expand

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13351:
--

 Summary: Column pruning fails on expand
 Key: SPARK-13351
 URL: https://issues.apache.org/jira/browse/SPARK-13351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Optimizer can't pruning the columns in Expand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12776) Implement Python API for Datasets

2016-02-16 Thread Gustavo Salazar Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149525#comment-15149525
 ] 

Gustavo Salazar Torres commented on SPARK-12776:


I can work on this, any pointers?

> Implement Python API for Datasets
> -
>
> Key: SPARK-12776
> URL: https://issues.apache.org/jira/browse/SPARK-12776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kevin Cox
>Priority: Minor
>
> Now that the Dataset API is in Scala and Java it would be awesome to see it 
> show up in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-16 Thread krishna ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

krishna ramachandran updated SPARK-13349:
-
Summary: adding a split and union to a streaming application cause big 
performance hit  (was: adding a split and union to a streaming application 
causes big performance hit)

> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
> Fix For: 1.4.2
>
>
> We have a streaming application containing approximately 12 stages every 
> batch, running in streaming mode (4 sec batches). Each stage persists output 
> to cassandra
> the pipeline stages 
> stage 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more stages of transforms and save to database. 
> Around stage 5, we union the output of Dstream from stage 1 (in red) with 
> another stream (generated by split during stage 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream (red block output), performance improves substantially but hit out of 
> memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application causes big performance hit

2016-02-16 Thread krishna ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

krishna ramachandran updated SPARK-13349:
-
Summary: adding a split and union to a streaming application causes big 
performance hit  (was: enabling cache causes out of memory error. Caching 
DStream helps reduce processing time in a streaming application but get out of 
memory errors)

> adding a split and union to a streaming application causes big performance hit
> --
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
> Fix For: 1.4.2
>
>
> We have a streaming application containing approximately 12 stages every 
> batch, running in streaming mode (4 sec batches). Each stage persists output 
> to cassandra
> the pipeline stages 
> stage 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more stages of transforms and save to database. 
> Around stage 5, we union the output of Dstream from stage 1 (in red) with 
> another stream (generated by split during stage 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream (red block output), performance improves substantially but hit out of 
> memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"

2016-02-16 Thread Christopher Aycock (JIRA)
Christopher Aycock created SPARK-13350:
--

 Summary: Configuration documentation incorrectly states that 
PYSPARK_PYTHON's default is "python"
 Key: SPARK-13350
 URL: https://issues.apache.org/jira/browse/SPARK-13350
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Christopher Aycock
Priority: Trivial


The configuration documentation states that the environment variable 
PYSPARK_PYTHON has a default value of {{python}}:

http://spark.apache.org/docs/latest/configuration.html

In fact, the default is {{python2.7}}:

https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45

The change that introduced this was discussed here:

https://github.com/apache/spark/pull/2651

Would it be possible to highlight this in the documentation?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149507#comment-15149507
 ] 

Xiao Li commented on SPARK-1:
-

Will try to submit a PR tonight. When users specify a seed, I am unable to find 
a reason why we need to add partition id into the seed value.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13349) enabling cache causes out of memory error. Caching DStream helps reduce processing time in a streaming application but get out of memory errors

2016-02-16 Thread krishna ramachandran (JIRA)
krishna ramachandran created SPARK-13349:


 Summary: enabling cache causes out of memory error. Caching 
DStream helps reduce processing time in a streaming application but get out of 
memory errors
 Key: SPARK-13349
 URL: https://issues.apache.org/jira/browse/SPARK-13349
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.1
Reporter: krishna ramachandran
Priority: Critical
 Fix For: 1.4.2


We have a streaming application containing approximately 12 stages every batch, 
running in streaming mode (4 sec batches). Each stage persists output to 
cassandra

the pipeline stages 
stage 1

---> receive Stream A --> map --> filter -> (union with another stream B) --> 
map --> groupbykey --> transform --> reducebykey --> map

we go thro' few more stages of transforms and save to database. 

Around stage 5, we union the output of Dstream from stage 1 (in red) with 
another stream (generated by split during stage 2) and save that state

It appears the whole execution thus far is repeated which is redundant (I can 
see this in execution graph & also performance -> processing time). Processing 
time per batch nearly doubles or triples.

This additional & redundant processing cause each batch to run as much as 2.5 
times slower compared to runs without the union - union for most batches does 
not alter the original DStream (union with an empty set). If I cache the 
DStream (red block output), performance improves substantially but hit out of 
memory errors within few hours.

What is the recommended way to cache/unpersist in such a scenario? there is no 
dstream level "unpersist"

setting "spark.streaming.unpersist" to true and 
streamingContext.remember("duration") did not help. Still seeing out of memory 
errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13298) DAG visualization does not render correctly for jobs

2016-02-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149457#comment-15149457
 ] 

Shixiong Zhu commented on SPARK-13298:
--

Do you have a reproducer?

> DAG visualization does not render correctly for jobs
> 
>
> Key: SPARK-13298
> URL: https://issues.apache.org/jira/browse/SPARK-13298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Lucas Woltmann
> Attachments: dag_full.png, dag_viz.png
>
>
> Whenever I try to open the DAG for a job, I get something like this:
> !dag_viz.png!
> Obviously the svg doesn't get resized, but if I resize it manually, only the 
> first of four stages in the DAG is shown. 
> The js console says (variable v is null in peg$c34):
> {code:javascript}
> Uncaught TypeError: Cannot read property '3' of null
>   peg$c34 @ graphlib-dot.min.js:1
>   peg$parseidDef @ graphlib-dot.min.js:1
>   peg$parseaList @ graphlib-dot.min.js:1
>   peg$parseattrListBlock @ graphlib-dot.min.js:1
>   peg$parseattrList @ graphlib-dot.min.js:1
>   peg$parsenodeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsegraphStmt @ graphlib-dot.min.js:1
>   parse @ graphlib-dot.min.js:2
>   readOne @ graphlib-dot.min.js:2
>   renderDot @ spark-dag-viz.js:281
>   (anonymous function) @ spark-dag-viz.js:248
>   (anonymous function) @ d3.min.js:
>   3Y @ d3.min.js:1
>   _a.each @ d3.min.js:3
>   renderDagVizForJob @ spark-dag-viz.js:207
>   renderDagViz @ spark-dag-viz.js:163
>   toggleDagViz @ spark-dag-viz.js:100
>   onclick @ ?id=2:153
> {code}
> (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10759) Missing Python code example in ML Programming guide

2016-02-16 Thread Jeremy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy updated SPARK-10759:
---
Comment: was deleted

(was: Cannot add example for code that doesn't exist.)

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13346) Using DataFrames iteratively leads to massive query plans, which slows execution

2016-02-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13346:
--
Summary: Using DataFrames iteratively leads to massive query plans, which 
slows execution  (was: DataFrame caching is not handled well during planning or 
execution)

> Using DataFrames iteratively leads to massive query plans, which slows 
> execution
> 
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13283:


Assignee: Apache Spark

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13283:


Assignee: (was: Apache Spark)

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149396#comment-15149396
 ] 

Apache Spark commented on SPARK-13283:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/11224

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-16 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149387#comment-15149387
 ] 

Xiu (Joe) Guo commented on SPARK-13283:
---

Yes, it is a different problem from 
[SPARK-13297|https://issues.apache.org/jira/browse/SPARK-13297]. We should 
escape the column name based on JdbcDialect.

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-02-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149349#comment-15149349
 ] 

Felix Cheung commented on SPARK-12846:
--

changes to fix Jenkins was in PR https://github.com/apache/spark/pull/10792

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>
> Add the background context mail thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-02-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149343#comment-15149343
 ] 

Felix Cheung commented on SPARK-12846:
--

actually I was referring to how Jenkins/tests were broken by 
https://github.com/apache/spark/pull/10658
not the documentation...

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jeff Zhang
>
> Add the background context mail thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13348) Avoid duplicated broadcasts

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13348:
--

 Summary: Avoid duplicated broadcasts
 Key: SPARK-13348
 URL: https://issues.apache.org/jira/browse/SPARK-13348
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


An broadcasted table could be used multiple times in a query, we should cache 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13347) Reuse the shuffle for duplicated exchange

2016-02-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13347:
--

 Summary: Reuse the shuffle for duplicated exchange
 Key: SPARK-13347
 URL: https://issues.apache.org/jira/browse/SPARK-13347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


In TPCDS query 47, the same exchange is used three times, we should re-use the 
ShuffleRowRDD to skip the duplicated stages.

{code}

 with v1 as(
 select i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy,
sum(ss_sales_price) sum_sales,
avg(sum(ss_sales_price)) over
  (partition by i_category, i_brand,
 s_store_name, s_company_name, d_year)
  avg_monthly_sales,
rank() over
  (partition by i_category, i_brand,
 s_store_name, s_company_name
   order by d_year, d_moy) rn
 from item, store_sales, date_dim, store
 where ss_item_sk = i_item_sk and
   ss_sold_date_sk = d_date_sk and
   ss_store_sk = s_store_sk and
   (
 d_year = 1999 or
 ( d_year = 1999-1 and d_moy =12) or
 ( d_year = 1999+1 and d_moy =1)
   )
 group by i_category, i_brand,
  s_store_name, s_company_name,
  d_year, d_moy),
 v2 as(
 select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, 
v1.d_year,
 v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, 
v1_lag.sum_sales psum,
 v1_lead.sum_sales nsum
 from v1, v1 v1_lag, v1 v1_lead
 where v1.i_category = v1_lag.i_category and
   v1.i_category = v1_lead.i_category and
   v1.i_brand = v1_lag.i_brand and
   v1.i_brand = v1_lead.i_brand and
   v1.s_store_name = v1_lag.s_store_name and
   v1.s_store_name = v1_lead.s_store_name and
   v1.s_company_name = v1_lag.s_company_name and
   v1.s_company_name = v1_lead.s_company_name and
   v1.rn = v1_lag.rn + 1 and
   v1.rn = v1_lead.rn - 1)
 select * from v2
 where  d_year = 1999 and
avg_monthly_sales > 0 and
case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) 
/ avg_monthly_sales else null end > 0.1
 order by sum_sales - avg_monthly_sales, 3
 limit 100
{code}

Since the SparkPlan is just a tree (not DAG), we can only do this in 
SparkPlan.execute() or final rule.

And we should also have a way to compare two SparkPlan whether they have same 
result or not (they may have different exprId, we should compare them after 
bind).

An quick experiment showed that we could have 2X improvement on this query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2016-02-16 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149309#comment-15149309
 ] 

Henry Saputra edited comment on SPARK-5158 at 2/16/16 9:14 PM:
---

HI All, seemed like all PRs for this issue are closed.

This PR: 
https://github.com/apache/spark/pull/265 

is closed claiming there is a more recent PR is being work on, which I assume 
is this one:

https://github.com/apache/spark/pull/4106

but this one is also closed due to inactivity.

Looking at the issues filed that are closed as duplicate for this one, there is 
a need and interest to get standalone mode to access secured HDFS given the 
active users keytab already available to the machines that run Spark.


was (Author: hsaputra):
All, the PR for this issues are closed.

This PR: 
https://github.com/apache/spark/pull/265 

is closed claiming there is a more recent PR is being work on, which I assume 
is this one:

https://github.com/apache/spark/pull/4106

but this one is also closed due to inactivity.

Looking at the issues filed that are closed as duplicate for this one, there is 
a need and interest to get standalone mode to access secured HDFS given the 
active users keytab already available to the machines that run Spark.

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2016-02-16 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149309#comment-15149309
 ] 

Henry Saputra commented on SPARK-5158:
--

All, the PR for this issues are closed.

This PR: 
https://github.com/apache/spark/pull/265 

is closed claiming there is a more recent PR is being work on, which I assume 
is this one:

https://github.com/apache/spark/pull/4106

but this one is also closed due to inactivity.

Looking at the issues filed that are closed as duplicate for this one, there is 
a need and interest to get standalone mode to access secured HDFS given the 
active users keytab already available to the machines that run Spark.

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases

2016-02-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13308.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error 
> cases
> --
>
> Key: SPARK-13308
> URL: https://issues.apache.org/jira/browse/SPARK-13308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to 
> it except in certain error cases. Instead, ManagedBuffers should be freed 
> once messages created from them are consumed and destroyed by lower layers of 
> the Netty networking code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2016-02-16 Thread Jeremy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149266#comment-15149266
 ] 

Jeremy commented on SPARK-10759:


Cannot add example for code that doesn't exist.

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13346) DataFrame caching is not handled well during planning or execution

2016-02-16 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-13346:
-

 Summary: DataFrame caching is not handled well during planning or 
execution
 Key: SPARK-13346
 URL: https://issues.apache.org/jira/browse/SPARK-13346
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley


I have an iterative algorithm based on DataFrames, and the query plan grows 
very quickly with each iteration.  Caching the current DataFrame at the end of 
an iteration does not fix the problem.  However, converting the DataFrame to an 
RDD and back at the end of each iteration does fix the problem.

Printing the query plans shows that the plan explodes quickly (10 lines, to 
several hundred lines, to several thousand lines, ...) with successive 
iterations.

The desired behavior is for the analyzer to recognize that a big chunk of the 
query plan does not need to be computed since it is already cached.  The 
computation on each iteration should be the same.

If useful, I can push (complex) code to reproduce the issue.  But it should be 
simple to see if you create an iterative algorithm which produces a new 
DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13346) DataFrame caching is not handled well during planning or execution

2016-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149227#comment-15149227
 ] 

Joseph K. Bradley commented on SPARK-13346:
---

CC: [~andrewor14] [~joshrosen] whom I spoke with about this issue

> DataFrame caching is not handled well during planning or execution
> --
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13345) Adding one way ANOVA to Spark ML stat

2016-02-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-13345:
--

 Summary: Adding one way ANOVA to Spark ML stat
 Key: SPARK-13345
 URL: https://issues.apache.org/jira/browse/SPARK-13345
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Priority: Minor


One way ANOVA (https://en.wikipedia.org/wiki/One-way_analysis_of_variance)  is 
used to determine whether there are any significant differences between the 
means of three or more independent (unrelated) groups. 

One prototype in 
https://github.com/hhbyyh/StatisticsOnSpark/blob/master/src/main/ANOVA/OneWayANOVA.scala
 

I'll send PR if this is a feature of interest. This can be further enriched 
with Post-Hoc and Factorial ANOVA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12154) Upgrade to Jersey 2

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12154:


Assignee: Apache Spark

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>Assignee: Apache Spark
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149176#comment-15149176
 ] 

Apache Spark commented on SPARK-12154:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/11223

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12154) Upgrade to Jersey 2

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12154:


Assignee: (was: Apache Spark)

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149155#comment-15149155
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] I found the root cause. : ) In the genCode of Randn and Rand, the 
seed is user-provided. However, the partitionID could be different. 
{code}
  s"$rngTerm = new $className(${seed}L + 
org.apache.spark.TaskContext.getPartitionId());")
{code}

If you remove that, you will get the right answer. 

{code}
  s"$rngTerm = new $className(${seed}L);")
{code}


> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace

2016-02-16 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13280.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> FileBasedWriteAheadLog logger name should be under o.a.s namespace
> --
>
> Key: SPARK-13280
> URL: https://issues.apache.org/jira/browse/SPARK-13280
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> The logger name in FileBasedWriteAheadLog is currently defined as:
> {code}
>   override protected val logName = s"WriteAheadLogManager $callerNameTag"
> {code}
> That has two problems:
> - It's not under the usual "org.apache.spark" namespace so changing the 
> logging configuration for that package does not affect it
> - we've seen cases where {{$callerNameTag}} was empty, in which case the 
> logger name would have a trailing space, making it impossible to disable it 
> using a properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2016-02-16 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149139#comment-15149139
 ] 

Alexander Ulanov commented on SPARK-9273:
-

Hi [~gsateesh110],

Besides the one mentioned by Yuhao, there is SparkNet that allows using Caffe. 

In future, I plan to switch the present neural network implementation in Spark 
to tensors, and probably implement CNN, that is easier with tensors: 
https://github.com/avulanov/spark/tree/mlp-tensor

Best regards, Alexander

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong

2016-02-16 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-11701.
---
Resolution: Duplicate

> YARN - dynamic allocation and speculation active task accounting wrong
> --
>
> Key: SPARK-11701
> URL: https://issues.apache.org/jira/browse/SPARK-11701
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> I am using dynamic container allocation and speculation and am seeing issues 
> with the active task accounting.  The Executor UI still shows active tasks on 
> the an executor but the job/stage is all completed.  I think its also 
> affecting the dynamic allocation being able to release containers because it 
> thinks there are still tasks.
> Its easily reproduce by using spark-shell, turn on dynamic allocation, then 
> run just a wordcount on decent sized file and save back to hdfs and set the 
> speculation parameters low: 
>  spark.dynamicAllocation.enabled true
>  spark.shuffle.service.enabled true
>  spark.dynamicAllocation.maxExecutors 10
>  spark.dynamicAllocation.minExecutors 2
>  spark.dynamicAllocation.initialExecutors 10
>  spark.dynamicAllocation.executorIdleTimeout 40s
> $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
> spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 
> --master yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13027) Add API for updateStateByKey to provide batch time as input

2016-02-16 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149126#comment-15149126
 ] 

Aaditya Ramesh commented on SPARK-13027:


Hi [~zsxwing] sorry to bump this again, I've submitted a new patch. Could you 
take a look when you get a chance?

> Add API for updateStateByKey to provide batch time as input
> ---
>
> Key: SPARK-13027
> URL: https://issues.apache.org/jira/browse/SPARK-13027
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Aaditya Ramesh
>
> The StateDStream currently does not provide the batch time as input to the 
> state update function. This is required in cases where the behavior depends 
> on the batch start time.
> We (Conviva) have been patching it manually for the past several Spark 
> versions but we thought it might be useful for others as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong

2016-02-16 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-11701:
--
Description: 
I am using dynamic container allocation and speculation and am seeing issues 
with the active task accounting.  The Executor UI still shows active tasks on 
the an executor but the job/stage is all completed.  I think its also affecting 
the dynamic allocation being able to release containers because it thinks there 
are still tasks.

Its easily reproduce by using spark-shell, turn on dynamic allocation, then run 
just a wordcount on decent sized file and save back to hdfs and set the 
speculation parameters low: 

 spark.dynamicAllocation.enabled true
 spark.shuffle.service.enabled true
 spark.dynamicAllocation.maxExecutors 10
 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors 10
 spark.dynamicAllocation.executorIdleTimeout 40s


$SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 --master 
yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g

  was:
I am using dynamic container allocation and speculation and am seeing issues 
with the active task accounting.  The Executor UI still shows active tasks on 
the an executor but the job/stage is all completed.  I think its also affecting 
the dynamic allocation being able to release containers because it thinks there 
are still tasks.

Its easily reproduce by using spark-shell, turn on dynamic allocation, then run 
just a wordcount on decent sized file and set the speculation parameters low: 

 spark.dynamicAllocation.enabled true
 spark.shuffle.service.enabled true
 spark.dynamicAllocation.maxExecutors 10
 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors 10
 spark.dynamicAllocation.executorIdleTimeout 40s


$SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 --master 
yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g


> YARN - dynamic allocation and speculation active task accounting wrong
> --
>
> Key: SPARK-11701
> URL: https://issues.apache.org/jira/browse/SPARK-11701
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> I am using dynamic container allocation and speculation and am seeing issues 
> with the active task accounting.  The Executor UI still shows active tasks on 
> the an executor but the job/stage is all completed.  I think its also 
> affecting the dynamic allocation being able to release containers because it 
> thinks there are still tasks.
> Its easily reproduce by using spark-shell, turn on dynamic allocation, then 
> run just a wordcount on decent sized file and save back to hdfs and set the 
> speculation parameters low: 
>  spark.dynamicAllocation.enabled true
>  spark.shuffle.service.enabled true
>  spark.dynamicAllocation.maxExecutors 10
>  spark.dynamicAllocation.minExecutors 2
>  spark.dynamicAllocation.initialExecutors 10
>  spark.dynamicAllocation.executorIdleTimeout 40s
> $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
> spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 
> --master yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13344) SaveLoadSuite has many accumulator exceptions

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149106#comment-15149106
 ] 

Apache Spark commented on SPARK-13344:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11222

> SaveLoadSuite has many accumulator exceptions
> -
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13344) SaveLoadSuite has many accumulator exceptions

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13344:


Assignee: Andrew Or  (was: Apache Spark)

> SaveLoadSuite has many accumulator exceptions
> -
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13344) SaveLoadSuite has many accumulator exceptions

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13344:


Assignee: Apache Spark  (was: Andrew Or)

> SaveLoadSuite has many accumulator exceptions
> -
>
> Key: SPARK-13344
> URL: https://issues.apache.org/jira/browse/SPARK-13344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> This is because SparkFunSuite clears all accumulators after every single 
> test. This suite reuses a DF and all of its associated internal accumulators 
> across many tests.
> This is likely caused by SPARK-10620.
> {code}
> 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
> unregistered accumulator 253 when reconstructing task metrics.
> 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
> accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 253
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13344) SaveLoadSuite has many accumulator exceptions

2016-02-16 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13344:
-

 Summary: SaveLoadSuite has many accumulator exceptions
 Key: SPARK-13344
 URL: https://issues.apache.org/jira/browse/SPARK-13344
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


This is because SparkFunSuite clears all accumulators after every single test. 
This suite reuses a DF and all of its associated internal accumulators across 
many tests.

This is likely caused by SPARK-10620.

{code}
10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered 
unregistered accumulator 253 when reconstructing task metrics.
10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update 
accumulators for task 0
org.apache.spark.SparkException: attempted to access non-existent accumulator 
253
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12976) Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.

2016-02-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12976.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10894
[https://github.com/apache/spark/pull/10894]

> Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.
> ---
>
> Key: SPARK-12976
> URL: https://issues.apache.org/jira/browse/SPARK-12976
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> Add LazilyGenerateOrdering to support generated ordering for RangePartitioner 
> of Exchange instead of InterpretedOrdering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12976) Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.

2016-02-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12976:
---
Assignee: Takuya Ueshin

> Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.
> ---
>
> Key: SPARK-12976
> URL: https://issues.apache.org/jira/browse/SPARK-12976
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>
> Add LazilyGenerateOrdering to support generated ordering for RangePartitioner 
> of Exchange instead of InterpretedOrdering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13343) speculative tasks that didn't commit shouldn't be marked as success

2016-02-16 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-13343:
-

 Summary: speculative tasks that didn't commit shouldn't be marked 
as success
 Key: SPARK-13343
 URL: https://issues.apache.org/jira/browse/SPARK-13343
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Thomas Graves


Currently Speculative tasks that didn't commit can show up as success of 
failures (depending on timing of commit). This is a bit confusing because that 
task didn't really succeed in the sense it didn't write anything.  

I think these tasks should be marked as KILLED or something that is more 
obvious to the user exactly what happened.  it is happened to hit the timing 
where it got a commit denied exception then it shows up as failed and counts 
against your task failures.  It shouldn't count against task failures since 
that failure really doesn't matter.

MapReduce handles these situation so perhaps we can look there for a model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149053#comment-15149053
 ] 

Joseph K. Bradley commented on SPARK-1:
---

I now have a much more complex example which does not use unionAll.  But it's 
still an issue with randn, so I suspect it's the same bug.  If needed, I can 
push a branch, but it's a mess of code.

[~smilegator] Thanks for taking a look.  I'll keep watching the JIRA!

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13242:


Assignee: Apache Spark

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>Assignee: Apache Spark
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149036#comment-15149036
 ] 

Apache Spark commented on SPARK-13242:
--

User 'joehalliwell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11221

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13242:


Assignee: (was: Apache Spark)

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13342) Cannot run INSERT statements in Spark

2016-02-16 Thread neo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neo updated SPARK-13342:

Priority: Critical  (was: Major)

> Cannot run INSERT statements in Spark
> -
>
> Key: SPARK-13342
> URL: https://issues.apache.org/jira/browse/SPARK-13342
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1, 1.6.0
>Reporter: neo
>Priority: Critical
>
> I cannot run a INSERT statement using spark-sql. I tried with both versions 
> 1.5.1 and 1.6.0 without any luck. But it runs ok on hive.
> These are the steps I took.
> 1) Launch hive and create the table / insert a record.
> create database test
> use test
> CREATE TABLE stgTable
> (
> sno string,
> total bigint
> );
> INSERT INTO TABLE stgTable VALUES ('12',12)
> 2) Launch spark-sql (1.5.1 or 1.6.0)
> 3) Try inserting a record from the shell
> INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1
> I got this error message 
> "Invalid method name: 'alter_table_with_cascade'"
> I tried changing the hive version inside the spark-sql shell  using SET 
> command.
> I changed the hive version
> from
> SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
> installation)
> to
> SET spark.sql.hive.version=0.14.0
> but that did not help either



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13327) colnames()<- allows invalid column names

2016-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149017#comment-15149017
 ] 

Apache Spark commented on SPARK-13327:
--

User 'olarayej' has created a pull request for this issue:
https://github.com/apache/spark/pull/11220

> colnames()<- allows invalid column names
> 
>
> Key: SPARK-13327
> URL: https://issues.apache.org/jira/browse/SPARK-13327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> colnames<- fails if:
> 1) Given colnames contain .
> 2) Given colnames contain NA
> 3) Given colnames are not character
> 4) Given colnames have different length than dataset's (SparkSQL error is 
> through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13327) colnames()<- allows invalid column names

2016-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13327:


Assignee: Apache Spark

> colnames()<- allows invalid column names
> 
>
> Key: SPARK-13327
> URL: https://issues.apache.org/jira/browse/SPARK-13327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Apache Spark
>
> colnames<- fails if:
> 1) Given colnames contain .
> 2) Given colnames contain NA
> 3) Given colnames are not character
> 4) Given colnames have different length than dataset's (SparkSQL error is 
> through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >