date:20160823

[jira] [Created] (SPARK-17207) Comparing Vector in relative tolerance or absolute tolerance in UnitTests error

2016-08-23 Thread Peng Meng (JIRA)

Peng Meng created SPARK-17207:
-

 Summary: Comparing Vector in relative tolerance or absolute 
tolerance in UnitTests error 
 Key: SPARK-17207
 URL: https://issues.apache.org/jira/browse/SPARK-17207
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Peng Meng


The result of compare two vectors using UnitTests 
(org.apache.spark.mllib.util.TestingUtils) is not right sometime.
For example:
val a = Vectors.dense(Arrary(1.0, 2.0))
val b = Vectors.zeros(0)
a ~== b absTol 1e-1 // the result is true. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434195#comment-15434195
 ] 

Hyukjin Kwon commented on SPARK-17174:
--

[~cloud_fan] I realised that {{Ruturns date ...}} might imply truncating time 
part so I closed the documentation change.

BTW, do you think those functions should change the return type according to 
the input type? It seems some DBMS does.

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work

2016-08-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434191#comment-15434191
 ] 

Dongjoon Hyun commented on SPARK-17198:
---

Hi, [~tuming].
The reported error scenario in HIVE-11312 seems to work without problems in 
Spark 2.0 like the following.

{code}
scala> sql("create table orc_test( col1 string, col2 char(10)) stored as orc 
tblproperties ('orc.compress'='NONE')")
scala> sql("insert into orc_test values ('val1', '1')")
scala> sql("select * from orc_test where col2='1'").show
+++
|col1|col2|
+++
|val1|   1|
+++
scala> spark.version
res3: String = 2.0.0
{code}

Could you give us some reproducible examples?

> ORC fixed char literal filter does not work
> ---
>
> Key: SPARK-17198
> URL: https://issues.apache.org/jira/browse/SPARK-17198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: tuming
>
> I have got wrong result when I run the following query in SparkSQL. 
> select * from orc_table where char_col ='5LZS';
> Table orc_table is a ORC format table.
> Column char_col is defined as char(6). 
> The hive record reader will return a char(6) string to the spark. And the 
> spark has no fixed char type. All fixed char type attributes are converted to 
> String by default. Meanwhile the constant literal is parsed to a string 
> Literal.  So it won't return true forever while doing the equal comparison. 
> For instance: '5LZS'=='5LZS  '.
> But I can get correct result in Hive using same data and sql string because 
> hive append spaces for those constant literal. Please refer to:
> https://issues.apache.org/jira/browse/HIVE-11312
> I found there is no such patch for spark.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17167) Issue Exceptions when Analyze Table on In-Memory Cataloged Tables

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434181#comment-15434181
 ] 

Apache Spark commented on SPARK-17167:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14781

> Issue Exceptions when Analyze Table on In-Memory Cataloged Tables
> -
>
> Key: SPARK-17167
> URL: https://issues.apache.org/jira/browse/SPARK-17167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, `Analyze Table` is only for Hive-serde tables. We should issue 
> exceptions in all the other cases. When the tables are data source tables, we 
> issued an exception. However, when tables are In-Memory Cataloged tables, we 
> do not issue any exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434154#comment-15434154
 ] 

Siddharth Murching commented on SPARK-3162:
---

Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)

https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16822) Support latex in scaladoc with MathJax

2016-08-23 Thread Jagadeesan A S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S closed SPARK-16822.
--
Resolution: Fixed

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Assignee: Shuai Lin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434131#comment-15434131
 ] 

Apache Spark commented on SPARK-17206:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/14780

> Support ANALYZE TABLE on analyzable temoprary table/view
> 
>
> Key: SPARK-17206
> URL: https://issues.apache.org/jira/browse/SPARK-17206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently ANALYZE TABLE DDL command can't work on temporary view.  However, 
> for the specified type of temporary view which is analyzable, we can support 
> the DDL command for it. So the CBO can work with temporary view too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17206:


Assignee: Apache Spark

> Support ANALYZE TABLE on analyzable temoprary table/view
> 
>
> Key: SPARK-17206
> URL: https://issues.apache.org/jira/browse/SPARK-17206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently ANALYZE TABLE DDL command can't work on temporary view.  However, 
> for the specified type of temporary view which is analyzable, we can support 
> the DDL command for it. So the CBO can work with temporary view too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17206:


Assignee: (was: Apache Spark)

> Support ANALYZE TABLE on analyzable temoprary table/view
> 
>
> Key: SPARK-17206
> URL: https://issues.apache.org/jira/browse/SPARK-17206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently ANALYZE TABLE DDL command can't work on temporary view.  However, 
> for the specified type of temporary view which is analyzable, we can support 
> the DDL command for it. So the CBO can work with temporary view too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view

2016-08-23 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-17206:
---

 Summary: Support ANALYZE TABLE on analyzable temoprary table/view
 Key: SPARK-17206
 URL: https://issues.apache.org/jira/browse/SPARK-17206
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently ANALYZE TABLE DDL command can't work on temporary view.  However, for 
the specified type of temporary view which is analyzable, we can support the 
DDL command for it. So the CBO can work with temporary view too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6235) Address various 2G limits

2016-08-23 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-6235:
---
Attachment: SPARK-6235_Design_V0.01.pdf

Preliminary Design Document.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.01.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-3162:
--
Comment: was deleted

(was: Here's a design doc with proposed changes - any comments/feedback are 
much appreciated :)
https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing
)

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-08-23 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434073#comment-15434073
 ] 

Weichen Xu commented on SPARK-17201:


yeah, you are right...

I search some proof for this such as this:
http://math.stackexchange.com/questions/209237/why-does-the-standard-bfgs-update-rule-preserve-positive-definiteness

to BFGS, it can be proven that the approximate Hassien will keep psd.

thanks for looking into this problem deeply !

> Investigate numerical instability for MLOR without regularization
> -
>
> Key: SPARK-17201
> URL: https://issues.apache.org/jira/browse/SPARK-17201
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> As mentioned 
> [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
> regularization is applied in Softmax regression, second order Newton solvers 
> may run into numerical instability problems. We should investigate this in 
> practice and find a solution, possibly by implementing pivoting when no 
> regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16862.
-
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.1.0

> Configurable buffer size in `UnsafeSorterSpillReader`
> -
>
> Key: SPARK-16862
> URL: https://issues.apache.org/jira/browse/SPARK-16862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
> buffer to read data off disk. This could be made configurable to improve on 
> disk reads.
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17174:


Assignee: Apache Spark

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Assignee: Apache Spark
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434038#comment-15434038
 ] 

Apache Spark commented on SPARK-17174:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14778

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17174:


Assignee: (was: Apache Spark)

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
Design doc link: 
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing)


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
Design doc link: 
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching commented on SPARK-3162:
---

Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing)


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434030#comment-15434030
 ] 

Xusen Yin commented on SPARK-16581:
---

Sure, no problem.

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17205:


Assignee: Josh Rosen  (was: Apache Spark)

> Literal.sql does not properly convert NaN and Infinity literals
> ---
>
> Key: SPARK-17205
> URL: https://issues.apache.org/jira/browse/SPARK-17205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
> needs to be special-cased instead of simply appending a suffix to the string 
> representation of the value



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433962#comment-15433962
 ] 

Apache Spark commented on SPARK-17205:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14777

> Literal.sql does not properly convert NaN and Infinity literals
> ---
>
> Key: SPARK-17205
> URL: https://issues.apache.org/jira/browse/SPARK-17205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
> needs to be special-cased instead of simply appending a suffix to the string 
> representation of the value



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17205:


Assignee: Apache Spark  (was: Josh Rosen)

> Literal.sql does not properly convert NaN and Infinity literals
> ---
>
> Key: SPARK-17205
> URL: https://issues.apache.org/jira/browse/SPARK-17205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
> needs to be special-cased instead of simply appending a suffix to the string 
> representation of the value



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-23 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17205:
---
Description: {{Literal.sql}} mishandles NaN and Infinity literals: the 
handling of these needs to be special-cased instead of simply appending a 
suffix to the string representation of the value  (was: {{Literal.sql}} 
mishandles NaN and Infinity literals: the handling of these needs to be 
special-cased.)

> Literal.sql does not properly convert NaN and Infinity literals
> ---
>
> Key: SPARK-17205
> URL: https://issues.apache.org/jira/browse/SPARK-17205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
> needs to be special-cased instead of simply appending a suffix to the string 
> representation of the value



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals

2016-08-23 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17205:
--

 Summary: Literal.sql does not properly convert NaN and Infinity 
literals
 Key: SPARK-17205
 URL: https://issues.apache.org/jira/browse/SPARK-17205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor


{{Literal.sql}} mishandles NaN and Infinity literals: the handling of these 
needs to be special-cased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-23 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Summary: Spark 2.0 off heap RDD persistence with replication factor 2 leads 
to in-memory data corruption  (was: Spark 2.0 off heap RDD persistence with 
replication factor 2 leads to data corruption)

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
>

[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Description: 
We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
with replication factor 2 and have always received exceptions on the executor 
side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433878#comment-15433878
 ] 

Michael Allman commented on SPARK-17204:


[~rxin] I rebuilt from master as of commit 
8fd63e808e15c8a7e78fef847183c86f332daa91 (which includes 
https://github.com/apache/spark/commit/8e223ea67acf5aa730ccf688802f17f6fc10907c)
 and am still experiencing this issue. I'll work on instructions to reproduce 
next.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to data 
> corruption
> -
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
>

[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-23 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433849#comment-15433849
 ] 

Herman van Hovell commented on SPARK-17099:
---

A small update. Disregard my previous diagnoses. This is caused by a bug in the 
{{EliminateOuterJoin}} rule. This converts the Right Outer join into an Inner 
join, see the following optimizer log:
{noformat}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5, 
int_col_2) * 2)#32] 




   Project [sum(coalesce(int_col_5, 
int_col_2))#34L, (coalesce(int_col_5, int_col_2) * 2)#32]
 +- Filter (isnotnull(sum(cast(coalesce(int_col_5#4, int_col_2#13) as 
bigint))#37L) && (sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L 
> cast((coalesce(int_col_5#4, int_col_2#13)#38 * 2) as bigint)))



  +- Filter 
(isnotnull(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L) && 
(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L > 
cast((coalesce(int_col_5#4, int_col_2#13)#38 * 2) as bigint)))
+- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4, 
-449)), coalesce(int_col_5#4, int_col_2#13)], [sum(cast(coalesce(int_col_5#4, 
int_col_2#13) as bigint)) AS sum(coalesce(int_col_5, int_col_2))#34L, 
(coalesce(int_col_5#4, int_col_2#13) * 2) AS (coalesce(int_col_5, int_col_2) * 
2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS 
sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L, 
coalesce(int_col_5#4, int_col_2#13) AS coalesce(int_col_5#4, int_col_2#13)#38]  
+- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4, 
-449)), coalesce(int_col_5#4, int_col_2#13)], [sum(cast(coalesce(int_col_5#4, 
int_col_2#13) as bigint)) AS sum(coalesce(int_col_5, int_col_2))#34L, 
(coalesce(int_col_5#4, int_col_2#13) * 2) AS (coalesce(int_col_5, int_col_2) * 
2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS 
sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L, 
coalesce(int_col_5#4, int_col_2#13) AS coalesce(int_col_5#4, int_col_2#13)#38]
   +- Filter isnotnull(coalesce(int_col_5#4, int_col_2#13)) 





 +- Filter 
isnotnull(coalesce(int_col_5#4, int_col_2#13))
! +- Join RightOuter, (int_col_2#13 = int_col_5#4)  





+- Join Inner, (int_col_2#13 = 
int_col_5#4)
 :- Project [value#2 AS int_col_5#4]





   :- Project [value#2 AS 
int_col_5#4]
 :  +- SerializeFromObject [input[0, int, true] AS value#2]

[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433833#comment-15433833
 ] 

Herman van Hovell commented on SPARK-17120:
---

PR https://github.com/apache/spark/pull/14661 fixes this

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433780#comment-15433780
 ] 

Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:02 PM:
-

TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.

(updated this: my first attempt at diagnoses was way off).


was (Author: hvanhovell):
TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue

[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433780#comment-15433780
 ] 

Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:01 PM:
-

TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.


was (Author: hvanhovell):
TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, 
into the left hand side of the join (which should have been the right hand 
side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an 
inner join.

The optimized plan before disabling the {{PushDownPredicate}} rule (I had to 
disable the {{PruneFilters}} rule to prevent the plan from being erased):
{noformat}
Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
+- Join Inner
   :- Project [value#2 AS int_col_6#4]
   :  +- Filter false
   : +- SerializeFromObject [input[0, int, true] AS value#2]
   :+- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9]
{noformat}

The optimized plan after disabling the {{PushDownPredicate}} rule:
{noformat}
== Optimized Logical Plan ==
Filter isnotnull(int_col#16)
+- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
   +- Join LeftOuter, false
  :- Project [value#2 AS int_col_6#4]
  :  +- SerializeFromObject [input[0, int, true] AS value#2]
  : +- ExternalRDD [obj#1]
  +- Project [value#10 AS int_col_1#12]
 +- SerializeFromObject [input[0, int, true] AS value#10]
+- ExternalRDD [obj#9]
{noformat}

Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this.

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433780#comment-15433780
 ] 

Herman van Hovell commented on SPARK-17120:
---

TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, 
into the left hand side of the join (which should have been the right hand 
side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an 
inner join.

The optimized plan before disabling the {{PushDownPredicate}} rule (I had to 
disable the {{PruneFilters}} rule to prevent the plan from being erased):
{noformat}
Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
+- Join Inner
   :- Project [value#2 AS int_col_6#4]
   :  +- Filter false
   : +- SerializeFromObject [input[0, int, true] AS value#2]
   :+- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9]
{noformat}

The optimized plan after disabling the {{PushDownPredicate}} rule:
{noformat}
== Optimized Logical Plan ==
Filter isnotnull(int_col#16)
+- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
   +- Join LeftOuter, false
  :- Project [value#2 AS int_col_6#4]
  :  +- SerializeFromObject [input[0, int, true] AS value#2]
  : +- ExternalRDD [obj#1]
  +- Project [value#10 AS int_col_1#12]
 +- SerializeFromObject [input[0, int, true] AS value#10]
+- ExternalRDD [obj#9]
{noformat}

Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this.

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-08-23 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433754#comment-15433754
 ] 

Seth Hendrickson edited comment on SPARK-17201 at 8/23/16 10:24 PM:


Restating some of what was said on github:

_Concern is that for softmax regression without regularization, the Hessian 
becomes singular and Newton methods can run into problems. Excerpt from this 
[link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the 
minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus 
gradient descent will not run into a local optima problems. But the Hessian is 
singular/non-invertible, which causes a straightforward implementation of 
Newton's method to run into numerical problems.)"_

I looked into this. It is true that for softmax regression the Hessian is 
Symmetric positive _semidefinite_, not symmetric positive definite. There is a 
good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. 
Still consider the quote from the resources mentioned above "... which causes a 
*straightforward* implementation of Newton's method to run into numerical 
problems." It's true the lack of positive definiteness can be a problem for 
*naive* Newton methods, but LBFGS is not a straightforward implementation - it 
does not use the Hessian directly, but it uses an approximation to the Hessian. 
In fact, there are an abundance of resources showing that as long as the 
initial Hessian approximation is symmetric positive definite, then the 
subsequent recursive updates are also symmetric positive definite. From one 
resource: 

"H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our 
initial guess of H0 is psd, it follows by induction each inverse Hessian 
estimate is as well. Since we can choose any H^(-1)_0 we want, including the 
identity matrix, this is easy to ensure."

I appreciate other opinions on this to make sure I am understanding things 
correctly. Seems like LBFGS will work fine even without regularization. Have we 
seen this problem in practice? cc [~dbtsai] [~WeichenXu123]


was (Author: sethah):
Restating some of what was said on github:

_Concern is that for softmax regression without regularization, the Hessian 
becomes singular and Newton methods can run into problems. Excerpt from this 
[link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the 
minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus 
gradient descent will not run into a local optima problems. But the Hessian is 
singular/non-invertible, which causes a straightforward implementation of 
Newton's method to run into numerical problems.)"_

I looked into this. It is true that for softmax regression the Hessian is 
Symmetric positive _semidefinite_, not symmetric positive definite. There is a 
good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. 
Still consider the quote from the resources mentioned above "... which causes a 
*straightforward* implementation of Newton's method to run into numerical 
problems." It's true the lack of positive definiteness can be a problem for 
*naive* Newton methods, but LBFGS is not a straightforward implementation - it 
does not use the Hessian directly, but it uses an approximation to the Hessian. 
In fact, there are an abundance of resources showing that as long as the 
initial Hessian approximation is symmetric positive definite, then the 
subsequent recursive updates are also symmetric positive definite. From one 
resource: 

"H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our 
initial guess of H0 is psd, it follows by induction each inverse Hessian 
estimate is as well. Since we can choose any H^(-1)_0 we want, including the 
identity matrix, this is easy to ensure."

I appreciate other opinions on this to make sure I am understanding things 
correctly. Seems like LBFGS will work fine even without regularization. cc 
[~dbtsai] [~WeichenXu123]

> Investigate numerical instability for MLOR without regularization
> -
>
> Key: SPARK-17201
> URL: https://issues.apache.org/jira/browse/SPARK-17201
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> As mentioned 
> [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
> regularization is applied in Softmax regression, second order Newton solvers 
> may run into numerical instability problems. We should investigate this in 
> practice and find a solution, possibly by implementing pivoting when no 
> regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To

[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-08-23 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433754#comment-15433754
 ] 

Seth Hendrickson commented on SPARK-17201:
--

Restating some of what was said on github:

_Concern is that for softmax regression without regularization, the Hessian 
becomes singular and Newton methods can run into problems. Excerpt from this 
[link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the 
minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus 
gradient descent will not run into a local optima problems. But the Hessian is 
singular/non-invertible, which causes a straightforward implementation of 
Newton's method to run into numerical problems.)"_

I looked into this. It is true that for softmax regression the Hessian is 
Symmetric positive _semidefinite_, not symmetric positive definite. There is a 
good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. 
Still consider the quote from the resources mentioned above "... which causes a 
*straightforward* implementation of Newton's method to run into numerical 
problems." It's true the lack of positive definiteness can be a problem for 
*naive* Newton methods, but LBFGS is not a straightforward implementation - it 
does not use the Hessian directly, but it uses an approximation to the Hessian. 
In fact, there are an abundance of resources showing that as long as the 
initial Hessian approximation is symmetric positive definite, then the 
subsequent recursive updates are also symmetric positive definite. From one 
resource: 

"H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our 
initial guess of H0 is psd, it follows by induction each inverse Hessian 
estimate is as well. Since we can choose any H^(-1)_0 we want, including the 
identity matrix, this is easy to ensure."

I appreciate other opinions on this to make sure I am understanding things 
correctly. Seems like LBFGS will work fine even without regularization. cc 
[~dbtsai] [~WeichenXu123]

> Investigate numerical instability for MLOR without regularization
> -
>
> Key: SPARK-17201
> URL: https://issues.apache.org/jira/browse/SPARK-17201
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> As mentioned 
> [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
> regularization is applied in Softmax regression, second order Newton solvers 
> may run into numerical instability problems. We should investigate this in 
> practice and find a solution, possibly by implementing pivoting when no 
> regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433739#comment-15433739
 ] 

Hyukjin Kwon commented on SPARK-17200:
--

Thank you so much!

> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-23 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17099:
---
Priority: Blocker  (was: Critical)

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17120:
---
Target Version/s: 2.0.1, 2.1.0  (was: 2.1.0)

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-23 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17194.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433492#comment-15433492
 ] 

Shivaram Venkataraman commented on SPARK-16581:
---

[~yinxusen] I created a PR for this as we are trying to get the CRAN release 
out and it will be good to have this in it. Apologies if this resulted in 
duplicate work.

[~felixcheung] Could you comment on the PR what kind of redesign you have in 
mind for the S4 class etc. ?

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433479#comment-15433479
 ] 

Apache Spark commented on SPARK-16581:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/14775

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16581:


Assignee: (was: Apache Spark)

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16581:


Assignee: Apache Spark

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-23 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-16508.
---
   Resolution: Fixed
 Assignee: Junyang Qian
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/14705 and 
https://github.com/apache/spark/pull/14734

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
> Fix For: 2.0.1, 2.1.0
>
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433408#comment-15433408
 ] 

Michael Allman commented on SPARK-17204:


[~rxin] I'll give it a try. Thanks for the heads up. I missed that Jira/PR.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to data 
> corruption
> -
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433394#comment-15433394
 ] 

Reynold Xin commented on SPARK-17204:
-

Does this problem still exist on today's master/branch-2.0? 

SPARK-16550 was merged. It might be fixed already.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to data 
> corruption
> -
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
>

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2016-08-23 Thread yeshwanth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1541#comment-1541
 ] 

yeshwanth commented on SPARK-5928:
--

i ran into this issue in a production job,


org.apache.spark.shuffle.FetchFailedException: Too large frame: 4323231670
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:97)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Too large frame: 4323231670
at 
org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more



> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I

[jira] [Comment Edited] (SPARK-14560) Cooperative Memory Management for Spillables

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433210#comment-15433210
 ] 

Sean Owen edited comment on SPARK-14560 at 8/23/16 5:11 PM:


I have a few somewhat-specific additional data points:

More memory didn't seem to help. A job that ran comfortably with tens of 
gigabytes total with Java serialization would fail even with almost a terabyte 
of memory available. The memory fraction was at the default of 0.75, or up to 
0.9. I don't think we tried less, on the theory that the shuffle memory ought 
to be tracked as part of the 'storage' memory?

But the same thing happened with the legacy memory manager.

Unhelpfully, the heap appeared full of byte[] and String.

The shuffle involved user classes that were reasonably complex: nested objects 
involving case classes, third-party library classes, etc. None of them were 
registered with Kryo. I tried registering most of them, on the theory that this 
was causing some in-memory serialized representation to become huge. It didn't 
seem to help, but I still wonder if there's a lead there. When Kryo doesn't 
know about a class it serializes its class name first, but not the class names 
of everything in the graph (right?) so it can only make so much difference. 
Java serialization does the same.

For the record, it's just this Spark app that reproduces it:
https://github.com/sryza/aas/blob/1st-edition/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala

I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor).


was (Author: srowen):
I have a few somewhat-specific additional data points:

More memory didn't seem to help. A job that ran comfortably with tens of 
gigabytes total with Java serialization would fail even with almost a terabyte 
of memory available. The memory fraction was at the default of 0.75, or up to 
0.9. I don't think we tried less, on the theory that the shuffle memory ought 
to be tracked as part of the 'storage' memory?

But the same thing happened with the legacy memory manager.

Unhelpfully, the heap appeared full of byte[] and String.

The shuffle involved user classes that were reasonably complex: nested objects 
involving case classes, third-party library classes, etc. None of them were 
registered with Kryo. I tried registering most of them, on the theory that this 
was causing some in-memory serialized representation to become huge. It didn't 
seem to help, but I still wonder if there's a lead there. When Kryo doesn't 
know about a class it serializes its class name first, but not the class names 
of everything in the graph (right?) so it can only make so much difference. 
Java serialization does the same.

For the record, it's just this Spark app that reproduces it:
https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala

I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor).

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
>

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433209#comment-15433209
 ] 

Dongjoon Hyun commented on SPARK-17200:
---

Hi, [~hyukjin.kwon].

FYI, for Window CI, there is AppVeyor ( https://www.appveyor.com/ ) which is 
similar with Travis CI, too.

Some Apache projects use Travis CI / Windows CI / Jenkins CI in parallel like 
the following.

https://github.com/apache/reef/pull/1099

> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433210#comment-15433210
 ] 

Sean Owen commented on SPARK-14560:
---

I have a few somewhat-specific additional data points:

More memory didn't seem to help. A job that ran comfortably with tens of 
gigabytes total with Java serialization would fail even with almost a terabyte 
of memory available. The memory fraction was at the default of 0.75, or up to 
0.9. I don't think we tried less, on the theory that the shuffle memory ought 
to be tracked as part of the 'storage' memory?

But the same thing happened with the legacy memory manager.

Unhelpfully, the heap appeared full of byte[] and String.

The shuffle involved user classes that were reasonably complex: nested objects 
involving case classes, third-party library classes, etc. None of them were 
registered with Kryo. I tried registering most of them, on the theory that this 
was causing some in-memory serialized representation to become huge. It didn't 
seem to help, but I still wonder if there's a lead there. When Kryo doesn't 
know about a class it serializes its class name first, but not the class names 
of everything in the graph (right?) so it can only make so much difference. 
Java serialization does the same.

For the record, it's just this Spark app that reproduces it:
https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala

I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor).

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-08-23 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433200#comment-15433200
 ] 

Davies Liu commented on SPARK-14560:


Even with SPARK-4452, we still can not say that we fixed the OOM problem 
totally, because the memory used by Java object (used by RDD) can't be measured 
and predicted exactly, in the case that they use memory than we thought, it 
will still OOM. The java serializer may use less memory than Kyro, so it helped 
in this case. Also, there are many memory are tracked in memory manager, for 
example, the buffer used by shuffle reader, they could also go beyond the 
capacity we preserved for all others, that could be also another cause for the 
OOM in large scale job. The workaround could be decreasing the memoryFaction to 
preserve more memory for all other stuff. Have you tried that?

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}}

[jira] [Resolved] (SPARK-13286) JDBC driver doesn't report full exception

2016-08-23 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13286.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14722
[https://github.com/apache/spark/pull/14722]

> JDBC driver doesn't report full exception
> -
>
> Key: SPARK-13286
> URL: https://issues.apache.org/jira/browse/SPARK-13286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Testing some failure scenarios (inserting data into postgresql where there is 
> a schema mismatch) , there is an exception thrown (fine so far) however it 
> doesn't report the actual SQL error.  It refers to a getNextException call 
> but this is beyond my non-existant Java skills to deal with correctly.  
> Supporting this would help users to see the SQL error quickly and resolve the 
> underlying problem.
> {noformat}
> Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core 
> VALUES('5fdf5...',) was aborted.  Call getNextException to see the cause.
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Michael Allman (JIRA)

Michael Allman created SPARK-17204:
--

 Summary: Spark 2.0 off heap RDD persistence with replication 
factor 2 leads to data corruption
 Key: SPARK-17204
 URL: https://issues.apache.org/jira/browse/SPARK-17204
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Michael Allman


We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
with replication factor 2 and have always received exceptions on the executor 
side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at

[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption

2016-08-23 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Description: 
We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
with replication factor 2 and have always received exceptions on the executor 
side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at

[jira] [Assigned] (SPARK-17203) data source options should always be case insensitive

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17203:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source options should always be case insensitive
> -
>
> Key: SPARK-17203
> URL: https://issues.apache.org/jira/browse/SPARK-17203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17203) data source options should always be case insensitive

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17203:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source options should always be case insensitive
> -
>
> Key: SPARK-17203
> URL: https://issues.apache.org/jira/browse/SPARK-17203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17203) data source options should always be case insensitive

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433005#comment-15433005
 ] 

Apache Spark commented on SPARK-17203:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14773

> data source options should always be case insensitive
> -
>
> Key: SPARK-17203
> URL: https://issues.apache.org/jira/browse/SPARK-17203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17202) "Pipeline guide" link is broken in MLlib Guide main page

2016-08-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17202.
---
Resolution: Duplicate

Yep, though this was already fixed in master. The next release docs would 
contain the fix.

> "Pipeline guide" link is broken in MLlib Guide main page
> 
>
> Key: SPARK-17202
> URL: https://issues.apache.org/jira/browse/SPARK-17202
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Vitalii Kotliarenko
>Priority: Trivial
>
> Steps to reproduce:
> 1) Check http://spark.apache.org/docs/latest/ml-guide.html 
> 2) Link in sentence "See the Pipelines guide for details" is broken, it 
> points to https://spark.apache.org/docs/latest/ml-pipeline.md
> Expected result: "Pipeline guide" link should point to 
> https://spark.apache.org/docs/latest/ml-pipeline.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17203) data source options should always be case insensitive

2016-08-23 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-17203:
---

 Summary: data source options should always be case insensitive
 Key: SPARK-17203
 URL: https://issues.apache.org/jira/browse/SPARK-17203
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17202) "Pipeline guide" link is broken in MLlib Guide main page

2016-08-23 Thread Vitalii Kotliarenko (JIRA)

Vitalii Kotliarenko created SPARK-17202:
---

 Summary: "Pipeline guide" link is broken in MLlib Guide main page
 Key: SPARK-17202
 URL: https://issues.apache.org/jira/browse/SPARK-17202
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 2.0.0
Reporter: Vitalii Kotliarenko
Priority: Trivial


Steps to reproduce:
1) Check http://spark.apache.org/docs/latest/ml-guide.html 
2) Link in sentence "See the Pipelines guide for details" is broken, it points 
to https://spark.apache.org/docs/latest/ml-pipeline.md

Expected result: "Pipeline guide" link should point to 
https://spark.apache.org/docs/latest/ml-pipeline.html






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-08-23 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17201:


 Summary: Investigate numerical instability for MLOR without 
regularization
 Key: SPARK-17201
 URL: https://issues.apache.org/jira/browse/SPARK-17201
 Project: Spark
  Issue Type: Sub-task
Reporter: Seth Hendrickson


As mentioned 
[here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
regularization is applied in Softmax regression, second order Newton solvers 
may run into numerical instability problems. We should investigate this in 
practice and find a solution, possibly by implementing pivoting when no 
regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-08-23 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432883#comment-15432883
 ] 

Imran Rashid commented on SPARK-14560:
--

One minor clarification -- SPARK-4452 is not included in Spark 1.6, but Sean 
was running a version with that fix backported.  So if you see this problem in 
Spark 1.6, (a) considering backporting SPARK-4452 and then try switching to 
java serialization.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of

[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7

2016-08-23 Thread Ozioma Ihekwoaba (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432767#comment-15432767
 ] 

Ozioma Ihekwoaba commented on SPARK-17126:
--

It works, all my jars get listed in the Spark Web UI.
I think from Java 6 upwards you can use the wildcard option to specify all jars 
in a classpath directory.
I get your point, my scenario was a spark-shell tutorial session for Spark-SQL 
using a custom Hive instance.
I needed a way to add the MySQL connector jar to the classpath for the Hive 
metastore, and also for other jars
like the Spark CSV jar.
Works like a charm on Linux, but failed repeatedly on Windows.
Just curious do you know of any company running production Spark clusters on 
Windows?
Cos it appears Spark is not built for Windows and all the examples point to a 
Linux setting.
Thing is lots of up and coming young devs are totally flummoxed by the Linux 
command-line, and since they use
Windows by default, Windows should be supported at a minimum...as a dev 
platform.
You know, like the sbin folder scripts are all bash scripts.

Ok, that was a subtle rant, maybe I should adapt the scripts myself to run on 
Windows.
Thanks for the awesome work!

> Errors setting driver classpath in spark-defaults.conf on Windows 7
> ---
>
> Key: SPARK-17126
> URL: https://issues.apache.org/jira/browse/SPARK-17126
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.1
> Environment: Windows 7
>Reporter: Ozioma Ihekwoaba
>
> I am having issues starting up Spark shell with a local hive-site.xml on 
> Windows 7.
> I have a local Hive 2.1.0 instance on Windows using a MySQL metastore.
> The Hive instance is working fine.
> I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf 
> folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib 
> folder.
> I was expecting Spark to pick up jar files in the lib folder automatically, 
> but found out Spark expects a spark.driver.extraC‌lassPath and 
> spark.executor.extraClassPath settings to resolve jars.
> Thing is this has failed on Windows for me with a 
> DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be 
> found.
> Here are some of the different file paths I've tried:
> C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar
> ".;C:\hadoop\spark\v161\lib\*"
> NONE has worked so far.
> Please, what is the correct way to set driver classpaths on Windows?
> Also, what is the correct file path format on Windows?
> I have it working fine on Linux but my current engagement requires me to run 
> Spark on a Windows box.
> Is there a way for Spark to automatically resolve jars from the lib folder in 
> all modes?
> Thanks.
> Ozzy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432715#comment-15432715
 ] 

Sean Owen commented on SPARK-17126:
---

Hm, I am not sure that "*" works on any JVM. Maybe I'm missing a reason it 
works for the env variable. But you would in general not specify it this way, 
which could be the problem. You would also not in general set app jar 
dependencies this way, but rather build them into your app.

> Errors setting driver classpath in spark-defaults.conf on Windows 7
> ---
>
> Key: SPARK-17126
> URL: https://issues.apache.org/jira/browse/SPARK-17126
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.1
> Environment: Windows 7
>Reporter: Ozioma Ihekwoaba
>
> I am having issues starting up Spark shell with a local hive-site.xml on 
> Windows 7.
> I have a local Hive 2.1.0 instance on Windows using a MySQL metastore.
> The Hive instance is working fine.
> I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf 
> folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib 
> folder.
> I was expecting Spark to pick up jar files in the lib folder automatically, 
> but found out Spark expects a spark.driver.extraC‌lassPath and 
> spark.executor.extraClassPath settings to resolve jars.
> Thing is this has failed on Windows for me with a 
> DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be 
> found.
> Here are some of the different file paths I've tried:
> C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar
> ".;C:\hadoop\spark\v161\lib\*"
> NONE has worked so far.
> Please, what is the correct way to set driver classpaths on Windows?
> Also, what is the correct file path format on Windows?
> I have it working fine on Linux but my current engagement requires me to run 
> Spark on a Windows box.
> Is there a way for Spark to automatically resolve jars from the lib folder in 
> all modes?
> Thanks.
> Ozzy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7

2016-08-23 Thread Ozioma Ihekwoaba (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432707#comment-15432707
 ] 

Ozioma Ihekwoaba edited comment on SPARK-17126 at 8/23/16 12:32 PM:


Hi Sean,
Thanks for the update.
What I meant was the driver path and executor path entries make it to the Web 
UI.
Meaning the values I set for the driver classpath and executor classpath are 
read by Spark during startup.
However, the jars I specified in the 2 paths are not on the classpath entries 
list in the Web UI.
They are also not loaded by Spark during startup.
For example, the Spark CSV jar and other associated jars are not loaded.
On Linux, the driver path jars and executor path jars are successfully added to 
the Spark classpath,
IN ADDITION to being listed in the Spark Web UI environment tab.
On Windows, the jars in the folder do not get listed in the Spark Web UI.

I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. 
That was it.
In summary, this worked when set in spark-env.cmd:
set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//*

But none of these did not work when set in spark-defaults.conf:
spark.driver.extraC‌lassPath  C:\\hadoop\\spark\\v162\\lib\\*
spark.driver.extraC‌lassPath  C://hadoop//spark//v162//lib//*
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar;
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar
spark.driver.extraC‌lassPath  file:/C:/hadoop/spark/v162/lib/*jar;
spark.driver.extraC‌lassPath  
file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar;

What I needed was a way to add all necessary jars to the classpath during 
startup, I found the commandline syntax for adding packages and driver jars too 
cumbersome.

Still wondering why just dropping jars in the lib folder (pre 2.0 versions) 
does not suffice as a default
folder to resolve jars.

Thanks,
Ozzy


was (Author: ozioma):
Hi Sean,
Thanks for the update.
What I meant was the driver path and executor path entries make it to the Web 
UI.
Meaning the values I set for the driver classpath and executor classpath are 
read by Spark during startup.
However, the jars I specified in the 2 paths are not on the classpath entries 
list in the Web UI.
They are also not loaded by Spark during startup.
For example, the Spark CSV jar and other associated jars are not loaded.
On Linux, the driver path jars and executor path jars are successfully added to 
the Spark classpath,
IN ADDITION to being listed in the Spark Web UI environment tab.
On Windows, the jars in the folder do not get listed in the Spark Web UI.

I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. 
That was it.
In summary, this worked when set in spark-env.cmd:
set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//*

But none of these did not work when set in spark-defaults.conf:
spark.driver.extraC‌lassPath  C:\\hadoop\\spark\\v162\\lib\\*
spark.driver.extraC‌lassPath  C://hadoop//spark//v162//lib//*
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar;
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar
spark.driver.extraC‌lassPath  file:/C:/hadoop/spark/v162/lib/*jar;
spark.driver.extraC‌lassPath  
file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar;

What I needed was add all necessary jars to the classpath during startup, I 
found the commandline syntax for 
adding packages and driver jars too cumbersome.

Still wondering why just dropping jars in the lib folder (pre 2.0 versions) 
does not suffice as a default
folder to resolve jars.

Thanks,
Ozzy

> Errors setting driver classpath in spark-defaults.conf on Windows 7
> ---
>
> Key: SPARK-17126
> URL: https://issues.apache.org/jira/browse/SPARK-17126
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.1
> Environment: Windows 7
>Reporter: Ozioma Ihekwoaba
>
> I am having issues starting up Spark shell with a local hive-site.xml on 
> Windows 7.
> I have a local Hive 2.1.0 instance on Windows using a MySQL metastore.
> The Hive instance is working fine.
> I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf 
> folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib 
> folder.
> I was expecting Spark to pick up jar files in the lib folder automatically, 
> but found out Spark expects a spark.driver.extraC‌lassPath and 
> spark.executor.extraClassPath settings to resolve jars.
> Thing is this has failed on Windows for me with a 
> DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be 
> found.
> Here are some

[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7

2016-08-23 Thread Ozioma Ihekwoaba (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432707#comment-15432707
 ] 

Ozioma Ihekwoaba commented on SPARK-17126:
--

Hi Sean,
Thanks for the update.
What I meant was the driver path and executor path entries make it to the Web 
UI.
Meaning the values I set for the driver classpath and executor classpath are 
read by Spark during startup.
However, the jars I specified in the 2 paths are not on the classpath entries 
list in the Web UI.
They are also not loaded by Spark during startup.
For example, the Spark CSV jar and other associated jars are not loaded.
On Linux, the driver path jars and executor path jars are successfully added to 
the Spark classpath,
IN ADDITION to being listed in the Spark Web UI environment tab.
On Windows, the jars in the folder do not get listed in the Spark Web UI.

I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. 
That was it.
In summary, this worked when set in spark-env.cmd:
set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//*

But none of these did not work when set in spark-defaults.conf:
spark.driver.extraC‌lassPath  C:\\hadoop\\spark\\v162\\lib\\*
spark.driver.extraC‌lassPath  C://hadoop//spark//v162//lib//*
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar;
spark.driver.extraC‌lassPath  
C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar
spark.driver.extraC‌lassPath  file:/C:/hadoop/spark/v162/lib/*jar;
spark.driver.extraC‌lassPath  
file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar;

What I needed was add all necessary jars to the classpath during startup, I 
found the commandline syntax for 
adding packages and driver jars too cumbersome.

Still wondering why just dropping jars in the lib folder (pre 2.0 versions) 
does not suffice as a default
folder to resolve jars.

Thanks,
Ozzy

> Errors setting driver classpath in spark-defaults.conf on Windows 7
> ---
>
> Key: SPARK-17126
> URL: https://issues.apache.org/jira/browse/SPARK-17126
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.1
> Environment: Windows 7
>Reporter: Ozioma Ihekwoaba
>
> I am having issues starting up Spark shell with a local hive-site.xml on 
> Windows 7.
> I have a local Hive 2.1.0 instance on Windows using a MySQL metastore.
> The Hive instance is working fine.
> I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf 
> folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib 
> folder.
> I was expecting Spark to pick up jar files in the lib folder automatically, 
> but found out Spark expects a spark.driver.extraC‌lassPath and 
> spark.executor.extraClassPath settings to resolve jars.
> Thing is this has failed on Windows for me with a 
> DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be 
> found.
> Here are some of the different file paths I've tried:
> C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar
> ".;C:\hadoop\spark\v161\lib\*"
> NONE has worked so far.
> Please, what is the correct way to set driver classpaths on Windows?
> Also, what is the correct file path format on Windows?
> I have it working fine on Linux but my current engagement requires me to run 
> Spark on a Windows box.
> Is there a way for Spark to automatically resolve jars from the lib folder in 
> all modes?
> Thanks.
> Ozzy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7

2016-08-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17126.
---
Resolution: Not A Problem

You said, "Yes the entries make it to the classpath, I checked up on the web 
UI." So I said I understood that you have successfully set the classpath to the 
desired value. That much is clear right?

One other thing I forgot to note is that Java does not support directories of 
JARs on the classpath. That could be the problem.

I was thinking that your problem is referring to local files on remote 
machines, but I'm not sure that's the issue. Normally, you build an "uber JAR" 
with your app and all dependencies and you do not set the classpath like this. 
That is another possible solution.

At the moment this looks like a question about using Spark, rather than a 
problem. That should take place on user@ really.

> Errors setting driver classpath in spark-defaults.conf on Windows 7
> ---
>
> Key: SPARK-17126
> URL: https://issues.apache.org/jira/browse/SPARK-17126
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.1
> Environment: Windows 7
>Reporter: Ozioma Ihekwoaba
>
> I am having issues starting up Spark shell with a local hive-site.xml on 
> Windows 7.
> I have a local Hive 2.1.0 instance on Windows using a MySQL metastore.
> The Hive instance is working fine.
> I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf 
> folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib 
> folder.
> I was expecting Spark to pick up jar files in the lib folder automatically, 
> but found out Spark expects a spark.driver.extraC‌lassPath and 
> spark.executor.extraClassPath settings to resolve jars.
> Thing is this has failed on Windows for me with a 
> DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be 
> found.
> Here are some of the different file paths I've tried:
> C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar
> ".;C:\hadoop\spark\v161\lib\*"
> NONE has worked so far.
> Please, what is the correct way to set driver classpaths on Windows?
> Also, what is the correct file path format on Windows?
> I have it working fine on Linux but my current engagement requires me to run 
> Spark on a Windows box.
> Is there a way for Spark to automatically resolve jars from the lib folder in 
> all modes?
> Thanks.
> Ozzy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17095) Latex and Scala doc do not play nicely

2016-08-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17095:
--
Assignee: Jagadeesan A S
Priority: Trivial  (was: Minor)

The net change here was slightly different: to fix up a few instances where 
LaTeX was being rendered as code but not cases involving "}}}"

> Latex and Scala doc do not play nicely
> --
>
> Key: SPARK-17095
> URL: https://issues.apache.org/jira/browse/SPARK-17095
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Seth Hendrickson
>Assignee: Jagadeesan A S
>Priority: Trivial
>  Labels: starter
> Fix For: 2.1.0
>
>
> In Latex, it is common to find "}}}" when closing several expressions at 
> once. [SPARK-16822|https://issues.apache.org/jira/browse/SPARK-16822] added 
> Mathjax to render Latex equations in scaladoc. However, when scala doc sees 
> "}}}" or "{{{" it treats it as a special character for code block. This 
> results in some very strange output.
> A poor workaround is to use "}}\,}" in latex which inserts a small 
> whitespace. This is not ideal, and we can hopefully find a better solution. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17095) Latex and Scala doc do not play nicely

2016-08-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17095.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14688
[https://github.com/apache/spark/pull/14688]

> Latex and Scala doc do not play nicely
> --
>
> Key: SPARK-17095
> URL: https://issues.apache.org/jira/browse/SPARK-17095
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Seth Hendrickson
>Priority: Minor
>  Labels: starter
> Fix For: 2.1.0
>
>
> In Latex, it is common to find "}}}" when closing several expressions at 
> once. [SPARK-16822|https://issues.apache.org/jira/browse/SPARK-16822] added 
> Mathjax to render Latex equations in scaladoc. However, when scala doc sees 
> "}}}" or "{{{" it treats it as a special character for code block. This 
> results in some very strange output.
> A poor workaround is to use "}}\,}" in latex which inserts a small 
> whitespace. This is not ideal, and we can hopefully find a better solution. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison

2016-08-23 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17199.
---
   Resolution: Fixed
 Assignee: Jacek Laskowski
Fix Version/s: 2.1.0

> Use CatalystConf.resolver for case-sensitivity comparison
> -
>
> Key: SPARK-17199
> URL: https://issues.apache.org/jira/browse/SPARK-17199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. 
> There's no need to repeat the code across the codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent edited comment on SPARK-17055 at 8/23/16 10:34 AM:
---

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Personally I think it's fine to keep the way it is, 
though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.


was (Author: vincexie):
sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432521#comment-15432521
 ] 

Hyukjin Kwon commented on SPARK-17200:
--

Maybe, I will try to do this and I guess this will be done with a lot of 
messing around with flaky builds.
I have tried this combination before but the combination itself is really 
flaky. So, I am thinking this would not be merged into codebase but I probably 
use this at least for SparkR (if I make it).

I will try this one anyway but please leave any thoughts if you have any better 
idea.


> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432511#comment-15432511
 ] 

Sean Owen commented on SPARK-17200:
---

Without a doubt that would be beneficial. The question is how much work it 
takes to set up and maintain, and that I don't know. 

While I think the goal is to make Spark run on Windows if possible, I'm not 
sure how well the dev setup / tests support Windows.

> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432508#comment-15432508
 ] 

Hyukjin Kwon commented on SPARK-17200:
--

cc [~felixcheung] and [~shivaram] who might be interested in this ticket (from 
the linked PR above).

> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432504#comment-15432504
 ] 

Hyukjin Kwon commented on SPARK-17200:
--

Could I please ask your opinion [~srowen]? I know you are an expert in this 
area.

> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-23 Thread Amit Baghel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432503#comment-15432503
 ] 

Amit Baghel commented on SPARK-17174:
-

Please go ahead and submit PR. Thanks

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17200:
-
Description: 
It seems there is no automated tests on Windows (I am not sure this is being 
done manually before each release).

Assuming from this comment, 
https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we 
have Windows infrastructure in the AMPLab Jenkins cluster.

It seems pretty much important because as far as I know we should manually test 
and verify some patches related with Windows-specific problem.

For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794

I was thinking a combination with Travis CI and Docker with Windows image. 
Although this might not be merged, I will try to give a shot with this (at 
least for SparkR) anyway (just to verify some PRs I just linked above).

I would appreciate it if I can hear any thoughts about this.



  was:
It seems there is no automated tests on Windows (I am not sure this is being 
done manually before each release).

Assuming from this comment, 
https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we 
have Windows infrastructure in the AMPLab Jenkins cluster.

It seems pretty much important because as far as I know we should manually test 
and verify some patches related with Windows-specific problem.

For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794

I was thinking a combination with Travis CI and Docker with Windows image. 
Although this might not be merged, I will try to give a shot with this anyway 
(just to verify some PRs I just linked above).

I would appreciate it if I can hear any thoughts about this.




> Automate building and testing on Windows 
> -
>
> Key: SPARK-17200
> URL: https://issues.apache.org/jira/browse/SPARK-17200
> Project: Spark
>  Issue Type: Test
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> It seems there is no automated tests on Windows (I am not sure this is being 
> done manually before each release).
> Assuming from this comment, 
> https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems 
> we have Windows infrastructure in the AMPLab Jenkins cluster.
> It seems pretty much important because as far as I know we should manually 
> test and verify some patches related with Windows-specific problem.
> For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794
> I was thinking a combination with Travis CI and Docker with Windows image. 
> Although this might not be merged, I will try to give a shot with this (at 
> least for SparkR) anyway (just to verify some PRs I just linked above).
> I would appreciate it if I can hear any thoughts about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17200) Automate building and testing on Windows

2016-08-23 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-17200:


 Summary: Automate building and testing on Windows 
 Key: SPARK-17200
 URL: https://issues.apache.org/jira/browse/SPARK-17200
 Project: Spark
  Issue Type: Test
  Components: Build, Project Infra
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


It seems there is no automated tests on Windows (I am not sure this is being 
done manually before each release).

Assuming from this comment, 
https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we 
have Windows infrastructure in the AMPLab Jenkins cluster.

It seems pretty much important because as far as I know we should manually test 
and verify some patches related with Windows-specific problem.

For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794

I was thinking a combination with Travis CI and Docker with Windows image. 
Although this might not be merged, I will try to give a shot with this anyway 
(just to verify some PRs I just linked above).

I would appreciate it if I can hear any thoughts about this.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent edited comment on SPARK-17055 at 8/23/16 9:18 AM:
--

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.


was (Author: vincexie):
sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2016-08-23 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432470#comment-15432470
 ] 

holdenk commented on SPARK-12072:
-

My guess is probably not a directly related issue - your OOM from there seems 
to be happening inside of the JVM and this probably wouldn't be the cause of 
that.

> python dataframe ._jdf.schema().json() breaks on large metadata dataframes
> --
>
> Key: SPARK-12072
> URL: https://issues.apache.org/jira/browse/SPARK-12072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Rares Mirica
>
> When a dataframe contains a column with a large number of values in ml_attr, 
> schema evaluation will routinely fail on getting the schema as json, this 
> will, in turn, cause a bunch of problems with, eg: calling udfs on the schema 
> because calling columns relies on 
> _parse_datatype_json_string(self._jdf.schema().json())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics

2016-08-23 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432462#comment-15432462
 ] 

Steve Loughran commented on SPARK-9004:
---

If you know the filesystem, you can get summary stats from 
{{FileSystem.getStatistics()}}; they'd have to be collected across all the 
executors

These counters are per-JVM, not isolated into individual jobs

> Add s3 bytes read/written metrics
> -
>
> Key: SPARK-9004
> URL: https://issues.apache.org/jira/browse/SPARK-9004
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Abhishek Modi
>Priority: Minor
>
> s3 read/write metrics can be pretty useful in finding the total aggregate 
> data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent commented on SPARK-17055:
-

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2016-08-23 Thread Ben Teeuwen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432459#comment-15432459
 ] 

Ben Teeuwen commented on SPARK-12072:
-

[~holdenk] we haven't been able to test the patch above (yet). Workarounds have 
been created using non-dataframe like operations. But recently I seem to have 
hit a wall related to the above. The discussion I've started on the spark 
'user' mailinglist, topic "OOM with StringIndexer, 800m rows & 56m distinct 
value column", is that related to this ticket? Do you think your patch 
addresses it?

> python dataframe ._jdf.schema().json() breaks on large metadata dataframes
> --
>
> Key: SPARK-12072
> URL: https://issues.apache.org/jira/browse/SPARK-12072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Rares Mirica
>
> When a dataframe contains a column with a large number of values in ml_attr, 
> schema evaluation will routinely fail on getting the schema as json, this 
> will, in turn, cause a bunch of problems with, eg: calling udfs on the schema 
> because calling columns relies on 
> _parse_datatype_json_string(self._jdf.schema().json())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file

2016-08-23 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432435#comment-15432435
 ] 

Steve Loughran commented on SPARK-10297:


FWIW, S3 and the S3a doesn't have a size limit any more

> When save data to a data source table, we should bound the size of a saved 
> file
> ---
>
> Key: SPARK-10297
> URL: https://issues.apache.org/jira/browse/SPARK-10297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we save a table to a data source table, it is possible that a writer is 
> responsible to write out a larger number of rows, which can make the 
> generated file very large and cause job failed if the underlying storage 
> system has a limit of max file size (e.g. S3's limit is 5GB). We should bound 
> the size of a file generated by a writer and create new writers for the same 
> partition if necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1

2016-08-23 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432432#comment-15432432
 ] 

Steve Loughran commented on SPARK-15965:


This is being fixed with tests in my work in SPARK-7481; the manual workaround 
is

Spark 2: 

# Get the same hadoop version that your spark version is built against
# add hadoop-aws, everything with amazon-*.jar into the JARs subdir

Spark 1.6+

This needs my patch a rebuild of spark assembly. However, once that patch is 
in, trying to use the assembly without  the AWS JARs will stop spark from 
starting —unless you move up to Hadoop 2.7.3

> No FileSystem for scheme: s3n or s3a  spark-2.0.0 and spark-1.6.1
> -
>
> Key: SPARK-15965
> URL: https://issues.apache.org/jira/browse/SPARK-15965
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.1
> Environment: Debian GNU/Linux 8
> java version "1.7.0_79"
>Reporter: thauvin damien
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> The spark programming-guide explain that Spark can create distributed 
> datasets on Amazon S3 . 
> But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or 
> s3a. 
> sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH")
> sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", 
> "xxx")
> val 
> lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz")
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
> hadoop.7.2 .
> I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
> documentation to explain what jar we need to add and where ? ( for standalone 
> installation) .
> "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
> What env variable we need to set and what file we need to modifiy .
> Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
> "spark.driver.extraClassPath" and "spark.executor.extraClassPath"
> But Still Works with spark-1.6.1 pre build with hadoop2.4 
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432427#comment-15432427
 ] 

Sean Owen commented on SPARK-17055:
---

>From a comment in the PR, I get it. This is not actually about labels, but 
>about some arbitrary attribute or function of each example. The purpose is to 
>group examples into train/test such that examples with the same attribute 
>value always go into the same data set. So maybe you want all examples for one 
>customer ID to go into train, or all into test, but not split across both.

This needs a different name I think because 'label' has a specific and 
different meaning, and even scikit says they want to rename it. It's coherent, 
but I still don't know how useful it is. It would need to be reconstruted for 
Spark ML.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16363) Spark-submit doesn't work with IAM Roles

2016-08-23 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432421#comment-15432421
 ] 

Steve Loughran commented on SPARK-16363:


S3A uses {{com.amazonaws.auth.InstanceProfileCredentialsProvider}} to talk to 
Amazon EC2 Instance Metadata Service. Switch to S3A and Hadoop 2.7+ and you 
should be able to do this.

That said, I do want to make some changes to how Spark propagates env vars as 
(a) it ignores the AWS_SESSION env var and (b) it stamps on any existing 
id/secret. That's not going to help



> Spark-submit doesn't work with IAM Roles
> 
>
> Key: SPARK-16363
> URL: https://issues.apache.org/jira/browse/SPARK-16363
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
> Environment: Spark Stand-Alone with EC2 instances configured with IAM 
> Roles. 
>Reporter: Ashic Mahtab
>
> When running Spark Stand-alone in EC2 boxes, 
> spark-submit --master spark://master-ip:7077 --class Foo 
> --deploy-mode cluster --verbose s3://bucket/dir/foo/jar
> fails to find the jar even if AWS IAM roles are configured to allow the EC2 
> boxes (that are running Spark master, and workers) access to the file in S3. 
> The exception is provided below. It's asking us to set keys, etc. when the 
> boxes are configured via IAM roles. 
> 16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
> must be specified as the username or password (respectively) of a s3 URL, or 
> by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
> (respectively).
> java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
> must be specified as the username or password (respectively) of a s3 URL, or 
> by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
> (respectively).
> at 
> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
> at 
> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> at com.sun.proxy.$Proxy5.initialize(Unknown Source)
> at 
> org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
> at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395)
> at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
> at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432406#comment-15432406
 ] 

Sean Owen commented on SPARK-14560:
---

Even with the fix for SPARK-4452, I observed a problem like what's described 
here: running out of memory in the shuffle read phase rather inexplicably, 
which was worked around with numElementsForceSpillThreshold.

It turned out that enabling Java serialization caused the problem to go away 
entirely. This was in Spark 1.6. No idea why, but leaving this as a note for 
anyone who may find it, or if we later connect the dots elsewhere to an 
underlying problem.

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and

[jira] [Commented] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison

2016-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432363#comment-15432363
 ] 

Apache Spark commented on SPARK-17199:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/14771

> Use CatalystConf.resolver for case-sensitivity comparison
> -
>
> Key: SPARK-17199
> URL: https://issues.apache.org/jira/browse/SPARK-17199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. 
> There's no need to repeat the code across the codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17199:


Assignee: (was: Apache Spark)

> Use CatalystConf.resolver for case-sensitivity comparison
> -
>
> Key: SPARK-17199
> URL: https://issues.apache.org/jira/browse/SPARK-17199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. 
> There's no need to repeat the code across the codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison

2016-08-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17199:


Assignee: Apache Spark

> Use CatalystConf.resolver for case-sensitivity comparison
> -
>
> Key: SPARK-17199
> URL: https://issues.apache.org/jira/browse/SPARK-17199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> {{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. 
> There's no need to repeat the code across the codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison

2016-08-23 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-17199:
---

 Summary: Use CatalystConf.resolver for case-sensitivity comparison
 Key: SPARK-17199
 URL: https://issues.apache.org/jira/browse/SPARK-17199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.1.0
Reporter: Jacek Laskowski
Priority: Trivial


{{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. 
There's no need to repeat the code across the codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-23 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432314#comment-15432314
 ] 

Jason Moore commented on SPARK-17195:
-

That's right.

The JDBC API has ResultSetMetaData.isNullable returning:

* ResultSetMetaData.columnNoNulls (= 0) which means the column does not allow 
NULL values
* ResultSetMetaData.columnNullable (= 1) which means the column allows NULL 
values
* ResultSetMetaData.columnNullableUnknown (= 2) which means the nullability of 
a column's values is unknown

In Spark we take this result and do as you've described: If something is not 
non null then it is nullable.  See first link in the ticket description above.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432299#comment-15432299
 ] 

Sean Owen commented on SPARK-17195:
---

Got it. Is there really value in a third state? If something is not non null 
then it is nullable.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-23 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432285#comment-15432285
 ] 

Jason Moore commented on SPARK-17195:
-

Correct, currently Spark doesn't allow us to override what the driver 
determines as the schema to apply.  It wasn't that I was able to control this 
in the past, but that everything worked fine regardless of the value of 
StructField.nullable.  But now in 2.0, it appears to me that the new code 
generation will break when processing a (in my case String) field expected to 
not be null, but is null.

It's definitely not Spark's fault (I hope I've made that clear enough) and it's 
either in the JDBC driver, or further downstream in the database server itself, 
that the blame lies.  Over on the Teradata forums (link in the description 
above) I'm trying to raise with their team the possibility that they could be 
marking the column as "columnNullableUnknown" which Spark will then map to 
nullable=true.  We'll see if they can manage to do that, and then I hope my 
problem will be solved.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17082) Replace ByteBuffer with ChunkedByteBuffer

2016-08-23 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432284#comment-15432284
 ] 

Guoqiang Li commented on SPARK-17082:
-

OK

> Replace ByteBuffer with ChunkedByteBuffer
> -
>
> Key: SPARK-17082
> URL: https://issues.apache.org/jira/browse/SPARK-17082
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuffers can not be greater than 2G, should be replaced by 
> ChunkedByteBuffer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-23 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432279#comment-15432279
 ] 

Guoqiang Li commented on SPARK-17184:
-

ok

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16659) use Maven project to submit spark application via yarn-client

2016-08-23 Thread Jagadeesan A S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S closed SPARK-16659.
--
Resolution: Not A Problem

> use Maven project to submit spark application via yarn-client
> -
>
> Key: SPARK-16659
> URL: https://issues.apache.org/jira/browse/SPARK-16659
> Project: Spark
>  Issue Type: Question
>Reporter: Jack Jiang
>  Labels: newbie
>
> i want to use spark sql to execute hive sql in my maven project,here is the 
> main code:
>   System.setProperty("hadoop.home.dir",
>   "D:\\hadoop-common-2.2.0-bin-master");
>   SparkConf sparkConf = new SparkConf()
>   .setAppName("test").setMaster("yarn-client");
>   // .set("hive.metastore.uris", "thrift://172.30.115.59:9083");
>   SparkContext ctx = new SparkContext(sparkConf);
>   // ctx.addJar("lib/hive-hbase-handler-0.14.0.2.2.6.0-2800.jar");
>   HiveContext sqlContext = new 
> org.apache.spark.sql.hive.HiveContext(ctx);
>   String[] tables = sqlContext.tableNames();
>   for (String tablename : tables) {
>   System.out.println("tablename : " + tablename);
>   }
> when i run it,it comes to a error:
> 10:16:17,496  INFO Client:59 - 
>client token: N/A
>diagnostics: Application application_1468409747983_0280 failed 2 times 
> due to AM Container for appattempt_1468409747983_0280_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:http://hadoop003.icccuat.com:8088/proxy/application_1468409747983_0280/Then,
>  click on links to logs of each attempt.
> Diagnostics: File 
> file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/C:/Users/uatxj990267/AppData/Local/Temp/spark-8874c486-893d-4ac3-a088-48e4cdb484e1/__spark_conf__9007071161920501082.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:608)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:821)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:598)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:414)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Failing this attempt. Failing the application.
>ApplicationMaster host: N/A
>ApplicationMaster RPC port: -1
>queue: default
>start time: 1469067373412
>final status: FAILED
>tracking URL: 
> http://hadoop003.icccuat.com:8088/cluster/app/application_1468409747983_0280
>user: uatxj990267
> 10:16:17,496 ERROR SparkContext:96 - Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:123)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at com.huateng.test.SparkSqlDemo.main(SparkSqlDemo.java:33)
> but when i change this code setMaster("yarn-client") to 
> setMaster(local[2]),it's OK?what's wrong with it ?can anyone help me?

1 2 >

1 - 100 of 109 matches

Mail list logo