[jira] [Updated] (SPARK-16621) Generate stable SQLs in SQLBuilder

2016-07-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16621:
---
Assignee: Dongjoon Hyun

> Generate stable SQLs in SQLBuilder
> --
>
> Key: SPARK-16621
> URL: https://issues.apache.org/jira/browse/SPARK-16621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Currently, the generated SQLs have not-stable IDs for generated attributes.
> The stable generated SQL will give more benefit for understanding or testing 
> the queries. This issue provides stable SQL generation by the followings.
> * Provide unique ids for generated subqueries, `gen_subquery_xxx`.
> * Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
> **Before**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS 
> gen_subquery_0
> {code}
> **After**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16621) Generate stable SQLs in SQLBuilder

2016-07-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16621.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14257
[https://github.com/apache/spark/pull/14257]

> Generate stable SQLs in SQLBuilder
> --
>
> Key: SPARK-16621
> URL: https://issues.apache.org/jira/browse/SPARK-16621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Currently, the generated SQLs have not-stable IDs for generated attributes.
> The stable generated SQL will give more benefit for understanding or testing 
> the queries. This issue provides stable SQL generation by the followings.
> * Provide unique ids for generated subqueries, `gen_subquery_xxx`.
> * Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
> **Before**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS 
> gen_subquery_0
> {code}
> **After**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395063#comment-15395063
 ] 

Yin Huai commented on SPARK-16748:
--

Seems we should should not wrap the NPE in a TreeNodeException.

> Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY 
> clause
> ---
>
> Key: SPARK-16748
> URL: https://issues.apache.org/jira/browse/SPARK-16748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((c: String) => s"""${c.take(5)}""")
> spark.sql("SELECT cast(null as string) as 
> a").select(myUDF($"a").as("b")).orderBy($"b").collect
> {code}
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange rangepartitioning(b#345 ASC, 200)
> +- *Project [UDF(null) AS b#345]
>+- Scan OneRowRelation[]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16748:
-
Description: 
{code}
import org.apache.spark.sql.functions._
val myUDF = udf((c: String) => s"""${c.take(5)}""")
spark.sql("SELECT cast(null as string) as 
a").select(myUDF($"a").as("b")).orderBy($"b").collect
{code}

{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(b#345 ASC, 200)
+- *Project [UDF(null) AS b#345]
   +- Scan OneRowRelation[]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
{code}

  was:
{code}
import org.apache.spark.sql.functions._
val myUDF = udf((c: String) => s"""${c.take(5)}""")
spark.sql("SELECT cast(null as string) as 
a").select(myUDF($"a").as("b")).orderBy($"b").collect
{code}

{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(b#345 ASC, 200)
+- *Project [UDF(null) AS b#345]
   +- Scan OneRowRelation[]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 

[jira] [Created] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause

2016-07-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-16748:


 Summary: Errors thrown by UDFs cause TreeNodeException when the 
query has an ORDER BY clause
 Key: SPARK-16748
 URL: https://issues.apache.org/jira/browse/SPARK-16748
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


{code}
import org.apache.spark.sql.functions._
val myUDF = udf((c: String) => s"""${c.take(5)}""")
spark.sql("SELECT cast(null as string) as 
a").select(myUDF($"a").as("b")).orderBy($"b").collect
{code}

{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(b#345 ASC, 200)
+- *Project [UDF(null) AS b#345]
   +- Scan OneRowRelation[]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 16.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 16.0 (TID 862, 10.0.250.152): java.lang.NullPointerException
at 
scala.collection.immutable.StringOps$.slice$extension(StringOps.scala:42)
at scala.collection.immutable.StringOps.slice(StringOps.scala:31)
at 
scala.collection.IndexedSeqOptimized$class.take(IndexedSeqOptimized.scala:132)
at scala.collection.immutable.StringOps.take(StringOps.scala:31)
at 
line2663e8ad718d4a0f8a897d026b83f86035.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:47)
at 
line2663e8ad718d4a0f8a897d026b83f86035.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:47)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
at 
org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:261)
at 

[jira] [Updated] (SPARK-16734) Make sure examples in all language bindings are consistent

2016-07-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16734:
-
Target Version/s: 2.0.1, 2.1.0  (was: 2.0.0, 2.0.1, 2.1.0)

> Make sure examples in all language bindings are consistent
> --
>
> Key: SPARK-16734
> URL: https://issues.apache.org/jira/browse/SPARK-16734
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16747) MQTT as a streaming source for SQL Streaming.

2016-07-26 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-16747.
---
Resolution: Won't Fix

> MQTT as a streaming source for SQL Streaming.
> -
>
> Key: SPARK-16747
> URL: https://issues.apache.org/jira/browse/SPARK-16747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16747) MQTT as a streaming source for SQL Streaming.

2016-07-26 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395046#comment-15395046
 ] 

Prashant Sharma edited comment on SPARK-16747 at 7/27/16 5:05 AM:
--

This was created by mistake.


was (Author: prashant_):
This created by mistake.

> MQTT as a streaming source for SQL Streaming.
> -
>
> Key: SPARK-16747
> URL: https://issues.apache.org/jira/browse/SPARK-16747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16747) MQTT as a streaming source for SQL Streaming.

2016-07-26 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395046#comment-15395046
 ] 

Prashant Sharma commented on SPARK-16747:
-

This created by mistake.

> MQTT as a streaming source for SQL Streaming.
> -
>
> Key: SPARK-16747
> URL: https://issues.apache.org/jira/browse/SPARK-16747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16747) MQTT as a streaming source for SQL Streaming.

2016-07-26 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-16747:
---

 Summary: MQTT as a streaming source for SQL Streaming.
 Key: SPARK-16747
 URL: https://issues.apache.org/jira/browse/SPARK-16747
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Streaming
Reporter: Prashant Sharma
Assignee: Prashant Sharma






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-07-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395018#comment-15395018
 ] 

Apache Spark commented on SPARK-15194:
--

User 'praveendareddy21' has created a pull request for this issue:
https://github.com/apache/spark/pull/14375

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Hongyao Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongyao Zhao updated SPARK-16746:
-
Description: 
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{code:borderStyle=solid}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{code}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 

  was:
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{code:title=code|borderStyle=solid}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{code}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 


> Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs 
> timeout
> ---
>
> Key: SPARK-16746
> URL: https://issues.apache.org/jira/browse/SPARK-16746
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Hongyao Zhao
>Priority: Minor
>
> I wrote a spark streaming program which consume 1000 messages from one topic 
> of Kafka, did some transformation, and wrote the result back to another 
> topic. But only found 988 messages in the second topic. I checked log info 
> and confirmed all messages was received by receivers. But I found a hdfs 
> writing time out message printed from Class BatchedWriteAheadLog. 
> 
> I checkout source code and found code like this: 
>   
> {code:borderStyle=solid}
> /** Add received block. This event will get written to the write ahead 
> log (if enabled). */ 
>   def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
> try { 
>   

[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Hongyao Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongyao Zhao updated SPARK-16746:
-
Description: 
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{code:title=code|borderStyle=solid}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{code}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 

  was:
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{code:title=Bar.scala|borderStyle=solid}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{code}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 


> Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs 
> timeout
> ---
>
> Key: SPARK-16746
> URL: https://issues.apache.org/jira/browse/SPARK-16746
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Hongyao Zhao
>Priority: Minor
>
> I wrote a spark streaming program which consume 1000 messages from one topic 
> of Kafka, did some transformation, and wrote the result back to another 
> topic. But only found 988 messages in the second topic. I checked log info 
> and confirmed all messages was received by receivers. But I found a hdfs 
> writing time out message printed from Class BatchedWriteAheadLog. 
> 
> I checkout source code and found code like this: 
>   
> {code:title=code|borderStyle=solid}
> /** Add received block. This event will get written to the write ahead 
> log (if enabled). */ 
>   def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean 

[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Hongyao Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongyao Zhao updated SPARK-16746:
-
Description: 
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{code:title=Bar.scala|borderStyle=solid}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{code}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 

  was:
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{quote}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{quote}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 


> Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs 
> timeout
> ---
>
> Key: SPARK-16746
> URL: https://issues.apache.org/jira/browse/SPARK-16746
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Hongyao Zhao
>Priority: Minor
>
> I wrote a spark streaming program which consume 1000 messages from one topic 
> of Kafka, did some transformation, and wrote the result back to another 
> topic. But only found 988 messages in the second topic. I checked log info 
> and confirmed all messages was received by receivers. But I found a hdfs 
> writing time out message printed from Class BatchedWriteAheadLog. 
> 
> I checkout source code and found code like this: 
>   
> {code:title=Bar.scala|borderStyle=solid}
> /** Add received block. This event will get written to the write ahead 
> log (if enabled). */ 
>   def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
> try { 
>

[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Hongyao Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongyao Zhao updated SPARK-16746:
-
Description: 
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
{quote}
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
{quote}

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 

  was:
I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
```
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
```

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 


> Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs 
> timeout
> ---
>
> Key: SPARK-16746
> URL: https://issues.apache.org/jira/browse/SPARK-16746
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Hongyao Zhao
>Priority: Minor
>
> I wrote a spark streaming program which consume 1000 messages from one topic 
> of Kafka, did some transformation, and wrote the result back to another 
> topic. But only found 988 messages in the second topic. I checked log info 
> and confirmed all messages was received by receivers. But I found a hdfs 
> writing time out message printed from Class BatchedWriteAheadLog. 
> 
> I checkout source code and found code like this: 
>   
> {quote}
> /** Add received block. This event will get written to the write ahead 
> log (if enabled). */ 
>   def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
> try { 
>   val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 

[jira] [Created] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Hongyao Zhao (JIRA)
Hongyao Zhao created SPARK-16746:


 Summary: Spark streaming lost data when ReceiverTracker writes 
Blockinfo to hdfs timeout
 Key: SPARK-16746
 URL: https://issues.apache.org/jira/browse/SPARK-16746
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.1
Reporter: Hongyao Zhao
Priority: Minor


I wrote a spark streaming program which consume 1000 messages from one topic of 
Kafka, did some transformation, and wrote the result back to another topic. But 
only found 988 messages in the second topic. I checked log info and confirmed 
all messages was received by receivers. But I found a hdfs writing time out 
message printed from Class BatchedWriteAheadLog. 

I checkout source code and found code like this: 
  
```
/** Add received block. This event will get written to the write ahead log 
(if enabled). */ 
  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { 
try { 
  val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) 
  if (writeResult) { 
synchronized { 
  getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo 
} 
logDebug(s"Stream ${receivedBlockInfo.streamId} received " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId}") 
  } else { 
logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} 
receiving " + 
  s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write 
Ahead Log.") 
  } 
  writeResult 
} catch { 
  case NonFatal(e) => 
logError(s"Error adding block $receivedBlockInfo", e) 
false 
} 
  } 
```

It seems that ReceiverTracker tries to write block info to hdfs, but the 
write operation time out, this cause writeToLog function return false, and  
this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += 
receivedBlockInfo" is skipped. so the block info is lost. 

   The spark version I use is 1.6.1 and I did not turn on 
spark.streaming.receiver.writeAheadLog.enable. 

   I want to know whether or not this is a designed behaviour. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395003#comment-15395003
 ] 

Yanbo Liang commented on SPARK-16446:
-

Will send PR soon.

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2016-07-26 Thread Joe Chong (JIRA)
Joe Chong created SPARK-16745:
-

 Summary: Spark job completed however have to wait for 13 mins 
(data size is small)
 Key: SPARK-16745
 URL: https://issues.apache.org/jira/browse/SPARK-16745
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.1
 Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
Reporter: Joe Chong
Priority: Minor


I submitted a job in scala spark shell to show a DataFrame. The data size is 
about 43K. The job was successful in the end, but took more than 13 minutes to 
resolve. Upon checking the log, there's multiple exception raised on "Failed to 
check existence of class" with a java.net.connectionexpcetion message 
indicating timeout trying to connect to the port 52067, the repl port that 
Spark setup. Please assist to troubleshoot. Thanks. 

Started Spark in standalone mode

$ spark-shell --driver-memory 5g --master local[*]
16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: joechong
16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(joechong); users 
with modify permissions: Set(joechong)
16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/07/26 21:05:30 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:52067
16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
server' on port 52067.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: joechong
16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(joechong); users 
with modify permissions: Set(joechong)
16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
on port 52072.
16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/07/26 21:05:35 INFO Remoting: Starting remoting
16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
16/07/26 21:05:35 INFO util.Utils: Successfully started service 
'sparkDriverActorSystem' on port 52074.
16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
/private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
3.4 GB
16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/07/26 21:05:36 INFO server.AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
port 4040.
16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at http://10.199.29.218:4040
16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
localhost
16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: 
http://10.199.29.218:52067
16/07/26 21:05:36 INFO util.Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075.
16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on 52075
16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver, localhost, 
52075)
16/07/26 21:05:36 INFO storage.BlockManagerMaster: Registered BlockManager
16/07/26 21:05:36 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
16/07/26 21:05:37 INFO hive.HiveContext: Initializing execution hive, version 
1.2.1
16/07/26 21:05:37 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0
16/07/26 21:05:37 INFO client.ClientWrapper: Loaded 
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/07/26 21:05:38 INFO metastore.HiveMetaStore: 0: Opening 

[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes

2016-07-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1:
---
Description: 
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
 
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) 
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
 
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
 
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) 
at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) 
at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:230) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:193) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:201) 
at Main$.main(Main.scala:24) 
at Main.main(Main.scala)
{noformat}

  was:
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at 

[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes

2016-07-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1:
---
Description: 
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at 
Main$.main(Main.scala:24)at Main.main(Main.scala)
{noformat}

  was:
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{quote}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at 

[jira] [Resolved] (SPARK-16524) Add RowBatch and RowBasedHashMapGenerator

2016-07-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16524.
-
   Resolution: Fixed
 Assignee: Qifan Pu
Fix Version/s: 2.1.0

> Add RowBatch and RowBasedHashMapGenerator
> -
>
> Key: SPARK-16524
> URL: https://issues.apache.org/jira/browse/SPARK-16524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Qifan Pu
>Assignee: Qifan Pu
> Fix For: 2.1.0
>
>
> This JIRA adds the implementations for `RowBatch` and 
> `RowBasedHashMapGenerator`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394875#comment-15394875
 ] 

Marcelo Vanzin edited comment on SPARK-16725 at 7/27/16 1:04 AM:
-

There is a clear guideline for when you need a library that is different from 
what Spark provides: shade the library in your applications.

Is it optimal? No.

Does it cause jar bloat? Yes.

Does it suck? Totally.

*But it works.* For everybody. Right now. Without having to wait for Spark's 
dependencies to fix their own dependency mess, or for Spark to fix its own.

Some people can use the "userClassPathFirst" options to achieve similar things 
without shading, but that doesn't work for everyone.


was (Author: vanzin):
There is a clear guideline for when you need a library that is different from 
what Spark provides: shade the library in your applications.

Is it optimal? No.

Does it cause jar bloat? Yes.

Does it suck? Totally.

*But it works.* For everybody. Right now. Without having to wait for Spark's 
downstream dependencies to fix their own dependency mess, or for Spark to fix 
its own.

Some people can use the "userClassPathFirst" options to achieve similar things 
without shading, but that doesn't work for everyone.

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394875#comment-15394875
 ] 

Marcelo Vanzin commented on SPARK-16725:


There is a clear guideline for when you need a library that is different from 
what Spark provides: shade the library in your applications.

Is it optimal? No.

Does it cause jar bloat? Yes.

Does it suck? Totally.

*But it works.* For everybody. Right now. Without having to wait for Spark's 
downstream dependencies to fix their own dependency mess, or for Spark to fix 
its own.

Some people can use the "userClassPathFirst" options to achieve similar things 
without shading, but that doesn't work for everyone.

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Liang Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394864#comment-15394864
 ] 

Liang Ke commented on SPARK-16735:
--

https://github.com/apache/spark/pull/14374

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16715) Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-26 Thread Liang Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394860#comment-15394860
 ] 

Liang Ke commented on SPARK-16715:
--

Sorry, mistake the jira id.

> Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic 
> equals and hash"
> 
>
> Key: SPARK-16715
> URL: https://issues.apache.org/jira/browse/SPARK-16715
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
> AttributeReference's exprId wont' be "ExprId(1)". However, that depends on 
> when this test runs. It may happen to use "ExprId(1)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16744) spark.yarn.appMasterEnv handling assumes values should be appended

2016-07-26 Thread Kevin Grealish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394812#comment-15394812
 ] 

Kevin Grealish commented on SPARK-16744:


Issue discovered when doing narrower fix for SPARK-16110

> spark.yarn.appMasterEnv handling assumes values should be appended
> --
>
> Key: SPARK-16744
> URL: https://issues.apache.org/jira/browse/SPARK-16744
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
> Environment: YARN
>Reporter: Kevin Grealish
>
> When processing environment variables set via the conf setting 
> spark.yarn.appMasterEnv.*, if there is an existing value the new value is 
> appended assuming it is a CLASSPATH like value. This does not work for most 
> environment variables, for example PYSPARK_PYTHON.
> Related:
> https://issues.apache.org/jira/browse/MAPREDUCE-6491
> https://issues.apache.org/jira/browse/SPARK-16110
> https://github.com/apache/spark/pull/13824



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16744) spark.yarn.appMasterEnv handling assumes values should be appended

2016-07-26 Thread Kevin Grealish (JIRA)
Kevin Grealish created SPARK-16744:
--

 Summary: spark.yarn.appMasterEnv handling assumes values should be 
appended
 Key: SPARK-16744
 URL: https://issues.apache.org/jira/browse/SPARK-16744
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0
 Environment: YARN
Reporter: Kevin Grealish


When processing environment variables set via the conf setting 
spark.yarn.appMasterEnv.*, if there is an existing value the new value is 
appended assuming it is a CLASSPATH like value. This does not work for most 
environment variables, for example PYSPARK_PYTHON.

Related:
https://issues.apache.org/jira/browse/MAPREDUCE-6491
https://issues.apache.org/jira/browse/SPARK-16110
https://github.com/apache/spark/pull/13824



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16735:


Assignee: Apache Spark

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>Assignee: Apache Spark
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16735:


Assignee: (was: Apache Spark)

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16715) Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394783#comment-15394783
 ] 

Apache Spark commented on SPARK-16715:
--

User 'biglobster' has created a pull request for this issue:
https://github.com/apache/spark/pull/14374

> Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic 
> equals and hash"
> 
>
> Key: SPARK-16715
> URL: https://issues.apache.org/jira/browse/SPARK-16715
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
> AttributeReference's exprId wont' be "ExprId(1)". However, that depends on 
> when this test runs. It may happen to use "ExprId(1)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Min Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394768#comment-15394768
 ] 

Min Wei commented on SPARK-16725:
-

Still worth to use a newer version of Guava. Looks like the issue of upgrading 
to Guava v16+ is postponed to Hadoop 3.0 at least. 
   https://issues.apache.org/jira/browse/HADOOP-11319

Hopefully Guava devs. will be more disciplined with its API compatibility. 
There seems to be quite many JIRAs on Guava. 

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16621) Generate stable SQLs in SQLBuilder

2016-07-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16621:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-16576

> Generate stable SQLs in SQLBuilder
> --
>
> Key: SPARK-16621
> URL: https://issues.apache.org/jira/browse/SPARK-16621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> Currently, the generated SQLs have not-stable IDs for generated attributes.
> The stable generated SQL will give more benefit for understanding or testing 
> the queries. This issue provides stable SQL generation by the followings.
> * Provide unique ids for generated subqueries, `gen_subquery_xxx`.
> * Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
> **Before**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS 
> gen_subquery_0
> {code}
> **After**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15232) Add subquery SQL building tests to LogicalPlanToSQLSuite

2016-07-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394738#comment-15394738
 ] 

Dongjoon Hyun commented on SPARK-15232:
---

Hi, [~hvanhovell].
May I work on this issue?

> Add subquery SQL building tests to LogicalPlanToSQLSuite
> 
>
> Key: SPARK-15232
> URL: https://issues.apache.org/jira/browse/SPARK-15232
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Priority: Minor
>
> We currently test subquery SQL building using the {{HiveCompatibilitySuite}}. 
> The is not desired since SQL building is actually a part of sql/core and 
> because we are slowly reducing our dependency on Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16735:
-
Comment: was deleted

(was: hi, if some one can help me to push my patch to github ? 
thx a lot : ))

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16735:
-
Attachment: (was: SPARK-16735.patch)

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-26 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394654#comment-15394654
 ] 

Xin Ren edited comment on SPARK-16445 at 7/26/16 10:17 PM:
---

I'm still working on it, hopefully by end of this weekend I can submit PR :)

I just have a quick question that which parameters should be passed from R 
command?

For fit() of wrapper class, there are many parameters 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62
{code}
  def fit(
  formula: String,
  data: DataFrame,
  blockSize: Int,
  layers: Array[Int],
  initialWeights: Vector,
  solver: String,
  seed: Long,
  maxIter: Int,
  tol: Double,
  stepSize: Double
 ): MultilayerPerceptronClassifierWrapper = {
{code}


And for R part, should I pass all the parameters from R command? 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461

I find in the example 
(http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier),
 only below parameters are being set, the rest are just usign default values

{code}
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)
{code}


was (Author: iamshrek):
I'm still working on it, hopefully by end of this weekend I can submit PR :)

I just have a quick question that which parameters should be passed from R 
command?

For fit() of wrapper class, there are many parameters 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62
{code}
  def fit(
  formula: String,
  data: DataFrame,
  blockSize: Int,
  layers: Array[Int],
  initialWeights: Vector,
  solver: String,
  seed: Long,
  maxIter: Int,
  tol: Double,
  stepSize: Double
 ): MultilayerPerceptronClassifierWrapper = {
{code}


And for R part, should I pass all the parameters from R command? 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461
I find in the example 
(http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier),
 only below parameters are being set, the rest are just usign default values

{code}
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)
{code}

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-26 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394654#comment-15394654
 ] 

Xin Ren commented on SPARK-16445:
-

I'm still working on it, hopefully by end of this weekend I can submit PR :)

I just have a quick question that which parameters should be passed from R 
command?

For fit() of wrapper class, there are many parameters 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62
{code}
  def fit(
  formula: String,
  data: DataFrame,
  blockSize: Int,
  layers: Array[Int],
  initialWeights: Vector,
  solver: String,
  seed: Long,
  maxIter: Int,
  tol: Double,
  stepSize: Double
 ): MultilayerPerceptronClassifierWrapper = {
{code}


And for R part, should I pass all the parameters from R command? 
https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461
I find in the example 
(http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier),
 only below parameters are being set, the rest are just usign default values

{code}
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)
{code}

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394652#comment-15394652
 ] 

Sean Owen commented on SPARK-16725:
---

Again there are two Guavas here: the unshaded one for Hadoop et al, and the 
shaded one for Spark. The unshaded one is 14. If you update it, other things 
break. 

(In 1.6.x, there are actually unshaded Guava Optional in the assembly for 
historical reasons, but I don't know if that would trigger the warning.)

Spark requires Hadoop classes one way or the other. You certainly have them 
somewhere or nothing will work. You can build Spark to build in Hadoop classes 
or have them provided at runtime. Maybe you need the latter. If the runtime 
doesn't provide the dependency, how can any app work without its dependency 
bundled?

Yes, it's likely that the ideal resolution here is for spark-cassandra to shade 
Guava to be safe, if it's bundling unshaded Guava. I'm not suggesting it's your 
change to make, but, not Spark's either.

Shading means moving the dependency to a different namespace, not sure hat you 
mean. It's probably not realistic to get rid of Guava; it's useful. Even if you 
did the problem is that dependencies may not. I don't think it's random; 
versions are as high as they can be without apparently breaking compatibility 
with how other commonly-paired dependencies work. That is, there's a reason 
it's 14 and not 15.



> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394644#comment-15394644
 ] 

Sylvain Zimmer commented on SPARK-16740:


OK! Looks like that would be [~davies] :-)

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> 

[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394642#comment-15394642
 ] 

Xiangrui Meng commented on SPARK-16445:
---

[~iamshrek] Any updates?

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394639#comment-15394639
 ] 

Xiangrui Meng commented on SPARK-16446:
---

[~yanboliang] Any updates?

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR

2016-07-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16444:
--
Shepherd: Junyang Qian

> Isotonic Regression wrapper in SparkR
> -
>
> Key: SPARK-16444
> URL: https://issues.apache.org/jira/browse/SPARK-16444
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Miao Wang
>
> Implement Isotonic Regression wrapper and other utils in SparkR.
> {code}
> spark.isotonicRegression(data, formula, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394634#comment-15394634
 ] 

Dongjoon Hyun edited comment on SPARK-16740 at 7/26/16 10:03 PM:
-

You had better look up the one who made that code with `git blame` command and 
ask him after passing Jenkins.
That is the fastest way to get reviewed. :)


was (Author: dongjoon):
You had better look up the one you made that with `git blame` command and ask 
him after passing Jenkins.
That is the fastest way to get reviewed. :)

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> 

[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394634#comment-15394634
 ] 

Dongjoon Hyun commented on SPARK-16740:
---

You had better look up the one you made that with `git blame` command and ask 
him after passing Jenkins.
That is the fastest way to get reviewed. :)

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> 

[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394628#comment-15394628
 ] 

Sylvain Zimmer commented on SPARK-16740:


Thanks! I just did. Let me know if that's okay.

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> 

[jira] [Assigned] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16740:


Assignee: Apache Spark

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Apache Spark
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> 

[jira] [Assigned] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16740:


Assignee: (was: Apache Spark)

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> 

[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394624#comment-15394624
 ] 

Apache Spark commented on SPARK-16740:
--

User 'sylvinus' has created a pull request for this issue:
https://github.com/apache/spark/pull/14373

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> 

[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Min Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394622#comment-15394622
 ] 

Min Wei commented on SPARK-16725:
-

I would love to see Spark successful as a platform, and not caught up in some 
versioning mess down the road.  

>Spark shades Guava and therefore doesn't leak it.
This does not seem to be true. In my case, I only used spark-shell and 
spark-cassandra in the standalone environment, no Hadoop bits (at least not 
explicitly).

>Spark depends on Hadoop, Hadoop depends on unshaded Guava. 
It does not seem right that Spark has to provide a jar for Hadoop, the lower 
level. 

>shield yourself by shading is pretty good
For the context, "I" as the user, am not at "fault" here. I am using Spark, and 
one dependency component Spark-cassandra. It does not seem right that "I" as 
the consumer have to do any shading. 

On a separate note, looks like the Guava versioning specifically is a bit 
random. Maybe if Spark/Hadoop could not get rid of the dependency, one option 
is to copy the code with a different namespace, i.e. a more explicit shading. 

My two cents. 

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394587#comment-15394587
 ] 

Sean Owen commented on SPARK-16725:
---

Thanks for the reminder [~vanzin], I caught up again on the current state here. 

There's a long story here, but, no, Spark shades Guava and therefore doesn't 
leak it. However it used to have to 'leak' the Optional class in Guava before 
2.x.

Spark depends on Hadoop, Hadoop depends on unshaded Guava. Any assembly 
containing Spark needs Hadoop and some unshaded Guava, therefore. That's what 
the Guava 14 is about IIUC. I suspect that in principle Spark could both 
include Guava 14 for other dependencies and use shaded Guava 19 internally, but 
that could be tricky. I suppose there just hasn't been a compelling reason.

Guava isn't backwards compatible for more than a few versions, actually. Fair 
enough, they're all major releases. But moving the dependency forward in 
general does break things.

The bad news is that, whatever Spark does, you still face leakage from Hadoop 
or other projects. Hence the advice to shield yourself by shading is pretty 
good, downsides notwithstanding, because it's a problem that crops up in more 
than just Spark.

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Min Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394572#comment-15394572
 ] 

Min Wei commented on SPARK-16725:
-

First I am not blocked. Plus I can hack anything I need to make it work given 
the OSS nature. Here I am purely curious how Spark plans versioning management 
as a more general platform. 

As it stands, looks like Spark is "leaking" the Guava dependency. I just did a 
web search and looks like there is quite a bit of energy spent on this: 
   
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/HnTsWJkI5jo
   https://issues.apache.org/jira/browse/ZEPPELIN-620

My suggestion is that, the Spark platform needs to provide guidelines. So 
spark-cassandra etc. platform pieces built on top should follow it. Otherwise 
it will be painful for the upper stack developers or users to consume the whole 
stack. 

>Spark has to ship a Guava jar because Hadoop needs it 
I don't understand this. I assume Hadoop is a dependency for Spark. Spark uses 
v14 of Guava to shadows v11 in Hadoop? 

>Changing fro 14 to 16 will fix your use case, but what about someone who wants 
>a different version? 
As long as the version is moving forward, not backwards. Of course in this case 
Guava itself could have done a better job of backwards compatibility. [

>"shade your custom dependencies" works for everyone, 
Won't this cause code/jar bloat and pain for everyone?


> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394555#comment-15394555
 ] 

Dongjoon Hyun commented on SPARK-16740:
---

Maybe, you will modify the following line?
{code}
val range = maxKey - minKey
{code}

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> 

[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394553#comment-15394553
 ] 

Dongjoon Hyun commented on SPARK-16740:
---

Hi, [~sylvinus]. 
It looks like that. Could you make a PR for this issue?

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 

[jira] [Created] (SPARK-16743) converter and access code out of sync: createDataFrame on RDD[Option[C]] fails with MatchError

2016-07-26 Thread Daniel Barclay (JIRA)
Daniel Barclay created SPARK-16743:
--

 Summary: converter and access code out of sync: createDataFrame on 
RDD[Option[C]] fails with MatchError
 Key: SPARK-16743
 URL: https://issues.apache.org/jira/browse/SPARK-16743
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.2, 1.6.1
Reporter: Daniel Barclay


Calling {{SqlContext}}'s {{createDataFrame}} on an RDD of type 
{{RDD\[Option\[SomeUserClass]]}} leads to an internal error.

For example, if the first field of {{SomeUserClass}} is of type {{String}}, 
evaluating the RDD yields a {{MatchError}} referring an instance of 
{{SomeUserClass}} in  
{{org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl}}
 (which should have been passed only certain kinds of representations of 
strings).

The problem seems to be in {{ExistingRDD.scala}}'s 
{{RDDConversions.productToRowAdd(...)}}:

It has a list of converters that reflects the list of members of 
{{SomeUserClass}} (looking past the {{Option}} part of the RDD record type 
{{Option\[SomeUserClass]}}).

However, the data-access code ({{r.productElement\(i)}}) does not seem to look 
past the {{Option}} part correspondingly.  (It does not seem to also traverse 
from the Some instance to the {{SomeUserClass)}}.)

Therefore, it ends up passing the instance of {{SomeUserClass}} to the 
converter intended for the first member field of {{SomeUserClass}} (e.g., a 
String converter), yielding an internal error.

(If {{RDD\[Option\[...]]}} doesn't make sense in the first place, it should be 
rejected with a "conscious" error rather than failing with an internal 
inconsistency.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16742) Kerberos support for Spark on Mesos

2016-07-26 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-16742:
---

 Summary: Kerberos support for Spark on Mesos
 Key: SPARK-16742
 URL: https://issues.apache.org/jira/browse/SPARK-16742
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Reporter: Michael Gummelt


We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

2016-07-26 Thread Zoltan Fedor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394515#comment-15394515
 ] 

Zoltan Fedor commented on SPARK-16741:
--

Most databases allow you to set unique keys and disallow the insert of 
duplicate values on those keys, but unforuntaly usually you need special SQL in 
the INSERT (Spark's side) to ignore those errors silently. Unfortunately even 
if that would work, there would be certain cases when you are just not able to 
create a unique key (the data itself is not unique).
Hence I think the best option is to create an attribute in jdbc.write() which 
would allow the user to turn off of the effect of speculative mode only for 
that jdbc.write(). Basically that would allow the user to determine whether or 
not he wants to overwrite the global speculative mode for this very operation 
and run without speculative mode and with no duplicates.

> spark.speculation causes duplicate rows in df.write.jdbc()
> --
>
> Key: SPARK-16741
> URL: https://issues.apache.org/jira/browse/SPARK-16741
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
>Reporter: Zoltan Fedor
>Priority: Minor
>
> Since a fix added to Spark 1.6.2 we can write string data back into an Oracle 
> database, so I went to try it out and found that rows showed up duplicated in 
> the database table after they got inserted into our Oracle database.
> The code we use it very simple:
> df = sqlContext.sql("SELECT * FROM example_temp_table")
> df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")
> The data in the 'target_table' in the database has twice as many rows as the 
> 'df' dataframe in SparkSQL.
> After some investigation it turns out that this is caused by our 
> spark.speculation setting is being set to True.
> As soon as we turned this off, there were no more duplicates generated.
> This somewhat makes sense - spark.speculation causes the map jobs to run 2 
> copies - resulting in every row being inserted into our Oracle databases 
> twice.
> Probably the df.jdbc.write() method does not consider a Spark context running 
> in speculative mode, hence the inserts coming from the speculative map also 
> get inserted - causing to have every record inserted twice.
> Likely that this bug is independent from the database type (we use Oracle) 
> and whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

2016-07-26 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394508#comment-15394508
 ] 

Shea Parkes commented on SPARK-12261:
-

I still can't get this bug to reproduce reliably locally, but I can confirm 
that my proposed bandaid of exhausting the iterator at the end of 
{{takeUpToNumLeft()}} made the error go away entirely in manual testing.  Would 
the Spark maintainers be willing to accept such a band-aid into the main 
codebase, or is this something I'd just need to maintain in our own Spark 
distributions? (I already maintain a couple other modifications related to 
Windows mess.)

A more root cause fix would require more Scala knowledge than I have.  I'd 
likely propose to define a new constant (e.g. 
{{SpecialLengths.PYTHON_WORKER_EXITING}}) and put logic into Scala that would 
make it immediately stop sending new data over the socket...

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

2016-07-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16741:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Yeah, the problem is that it's a global setting. I wonder if it's possible to 
make your operation idempotent by failing if both inserts happen -- like by 
having some unique key constraint that the second insert would violate.

> spark.speculation causes duplicate rows in df.write.jdbc()
> --
>
> Key: SPARK-16741
> URL: https://issues.apache.org/jira/browse/SPARK-16741
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
>Reporter: Zoltan Fedor
>Priority: Minor
>
> Since a fix added to Spark 1.6.2 we can write string data back into an Oracle 
> database, so I went to try it out and found that rows showed up duplicated in 
> the database table after they got inserted into our Oracle database.
> The code we use it very simple:
> df = sqlContext.sql("SELECT * FROM example_temp_table")
> df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")
> The data in the 'target_table' in the database has twice as many rows as the 
> 'df' dataframe in SparkSQL.
> After some investigation it turns out that this is caused by our 
> spark.speculation setting is being set to True.
> As soon as we turned this off, there were no more duplicates generated.
> This somewhat makes sense - spark.speculation causes the map jobs to run 2 
> copies - resulting in every row being inserted into our Oracle databases 
> twice.
> Probably the df.jdbc.write() method does not consider a Spark context running 
> in speculative mode, hence the inserts coming from the speculative map also 
> get inserted - causing to have every record inserted twice.
> Likely that this bug is independent from the database type (we use Oracle) 
> and whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

2016-07-26 Thread Zoltan Fedor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394501#comment-15394501
 ] 

Zoltan Fedor commented on SPARK-16741:
--

Thanks Sean. I agree, it is debatable whether this is a bug or not. 
My thinking is that as I don't see a scenario where jdbc.write() would need to 
be running in speculative mode, it should automatically turn off speculative 
mode when you call jdbc.write(). jdbc.write() not turning speculative mode off 
automatically - whether that is a bug or a feature, is debatable. I am okey to 
change this to a feature request.

> spark.speculation causes duplicate rows in df.write.jdbc()
> --
>
> Key: SPARK-16741
> URL: https://issues.apache.org/jira/browse/SPARK-16741
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
>Reporter: Zoltan Fedor
>
> Since a fix added to Spark 1.6.2 we can write string data back into an Oracle 
> database, so I went to try it out and found that rows showed up duplicated in 
> the database table after they got inserted into our Oracle database.
> The code we use it very simple:
> df = sqlContext.sql("SELECT * FROM example_temp_table")
> df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")
> The data in the 'target_table' in the database has twice as many rows as the 
> 'df' dataframe in SparkSQL.
> After some investigation it turns out that this is caused by our 
> spark.speculation setting is being set to True.
> As soon as we turned this off, there were no more duplicates generated.
> This somewhat makes sense - spark.speculation causes the map jobs to run 2 
> copies - resulting in every row being inserted into our Oracle databases 
> twice.
> Probably the df.jdbc.write() method does not consider a Spark context running 
> in speculative mode, hence the inserts coming from the speculative map also 
> get inserted - causing to have every record inserted twice.
> Likely that this bug is independent from the database type (we use Oracle) 
> and whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

2016-07-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394480#comment-15394480
 ] 

Sean Owen commented on SPARK-16741:
---

Yeah, because the output isn't idempotent, I'm not sure you can use speculation 
here. Although the insert occurs in a transaction, it's possible for it to 
succeed in both tasks before one can be cancelled. I'm not sure it's therefore 
a bug.

> spark.speculation causes duplicate rows in df.write.jdbc()
> --
>
> Key: SPARK-16741
> URL: https://issues.apache.org/jira/browse/SPARK-16741
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
>Reporter: Zoltan Fedor
>
> Since a fix added to Spark 1.6.2 we can write string data back into an Oracle 
> database, so I went to try it out and found that rows showed up duplicated in 
> the database table after they got inserted into our Oracle database.
> The code we use it very simple:
> df = sqlContext.sql("SELECT * FROM example_temp_table")
> df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")
> The data in the 'target_table' in the database has twice as many rows as the 
> 'df' dataframe in SparkSQL.
> After some investigation it turns out that this is caused by our 
> spark.speculation setting is being set to True.
> As soon as we turned this off, there were no more duplicates generated.
> This somewhat makes sense - spark.speculation causes the map jobs to run 2 
> copies - resulting in every row being inserted into our Oracle databases 
> twice.
> Probably the df.jdbc.write() method does not consider a Spark context running 
> in speculative mode, hence the inserts coming from the speculative map also 
> get inserted - causing to have every record inserted twice.
> Likely that this bug is independent from the database type (we use Oracle) 
> and whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2016-07-26 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394485#comment-15394485
 ] 

Alok Singh commented on SPARK-16611:


Hi [~shivaram]

Sorry for late reply due to vacation. Here is the detail answers


a) lapply or map
  lapplyPartition or mapPartition
 
We have current pre-processing codes (i.e recode, na impute, outlier etc.)  
implemented that is different that what spark currently provides (but currently 
sparkR doesn't have it)  and hence we uses the apply on dataframe
lapplyPartition was added in the list since we don’t want to re-shuffle data 
for efficiency and would prefer for certain cases lappyPartition instead of 
lapply

b) flatMap:
not the high priority we can live without it :)

c) RDD:toRDD, getJRDD (i.e RDD api)

Since SystemML uses and works internally in the binary block matrix format and 
it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and 
extract the RDD and pass it to the systemML java api. Note that the use of RDD 
is discouraged from spark as using data frame enables one to use all the 
catalyst optimizer features. However, in our case we are sure that we will just 
get the RDD and convert to the binary block matrix so that systemML can 
consumes and do the heavy lifting.

d) cleanup.jobj:
SystemML uses the MLContext and matrixCharacteristic class that is instantiated 
in JVM and whose object ref is kept alive in the sparkR and later when systemML 
has done it’s computation. we cleanup the objects. The way we achieve it using 
the References classes in R and use it’s finalize method to register the 
cleanup.jobj once we have created the jobj via newJObject(“sysml.class”)



In general, I think goal our DataFrame only api is great but removing RDD apis 
100% would have many issues with out ext package on the top of sparkR. Can we 
continue to keep them  private (if we can't converse to the decision) for now?


For using dapply, one concern, we  have is that dapplyInternal always do the 
broadcast of the variables set via useBroadcast. However, in many cases lapply 
is what one needs as user know for sure that he will not be using the broadcast 
vars. Also lapply falls naturally in the R syntax. 

Thanks
Alok


> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

2016-07-26 Thread Zoltan Fedor (JIRA)
Zoltan Fedor created SPARK-16741:


 Summary: spark.speculation causes duplicate rows in df.write.jdbc()
 Key: SPARK-16741
 URL: https://issues.apache.org/jira/browse/SPARK-16741
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.2
 Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
Reporter: Zoltan Fedor


Since a fix added to Spark 1.6.2 we can write string data back into an Oracle 
database, so I went to try it out and found that rows showed up duplicated in 
the database table after they got inserted into our Oracle database.

The code we use it very simple:
df = sqlContext.sql("SELECT * FROM example_temp_table")
df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")

The data in the 'target_table' in the database has twice as many rows as the 
'df' dataframe in SparkSQL.

After some investigation it turns out that this is caused by our 
spark.speculation setting is being set to True.
As soon as we turned this off, there were no more duplicates generated.

This somewhat makes sense - spark.speculation causes the map jobs to run 2 
copies - resulting in every row being inserted into our Oracle databases twice.
Probably the df.jdbc.write() method does not consider a Spark context running 
in speculative mode, hence the inserts coming from the speculative map also get 
inserted - causing to have every record inserted twice.

Likely that this bug is independent from the database type (we use Oracle) and 
whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2016-07-26 Thread Eliano Marques (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394450#comment-15394450
 ] 

Eliano Marques commented on SPARK-8518:
---

Would it be possible to add some information about the coefficients standard 
deviation? 

According to the information 
https://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression,
 it would be possible to bring to the model outputs the second derivate of the 
gradient function for beta and log teta? 

Bringing this information to the model outputs would enable us to perform the 
standard statistical tests on the parameters, similar to what is done in the 
glm. 

Thanks

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394437#comment-15394437
 ] 

Sylvain Zimmer edited comment on SPARK-16740 at 7/26/16 7:54 PM:
-

I'm not a Scala expert but from a quick review of the code, it appears that 
it's easy to overflow the {{range}} variable in {{optimize()}}:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608

In my example, that would yield
{code}
scala> val range = 4661454128115150227L - (-5543241376386463808L)
range: Long = -8242048569207937581
{code}

Maybe we should add {{range >= 0}} as a condition of doing that optimization?


was (Author: sylvinus):
I'm not a Scala expert but from a quick review of the code, it appears that 
it's easy to overflow the {{range}} variable in {{optimize()}}:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608

In my example, that would yield
{{ scala> val range = 4661454128115150227L - (-5543241376386463808L)
range: Long = -8242048569207937581
}}

Maybe we should add {{range >= 0}} as a condition of doing that optimization?

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> 

[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394437#comment-15394437
 ] 

Sylvain Zimmer commented on SPARK-16740:


I'm not a Scala expert but from a quick review of the code, it appears that 
it's easy to overflow the {{range}} variable in {{optimize()}}:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608

In my example, that would yield
{{ scala> val range = 4661454128115150227L - (-5543241376386463808L)
range: Long = -8242048569207937581
}}

Maybe we should add {{range >= 0}} as a condition of doing that optimization?

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> 

[jira] [Created] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-07-26 Thread Sylvain Zimmer (JIRA)
Sylvain Zimmer created SPARK-16740:
--

 Summary: joins.LongToUnsafeRowMap crashes with 
NegativeArraySizeException
 Key: SPARK-16740
 URL: https://issues.apache.org/jira/browse/SPARK-16740
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.0.0
Reporter: Sylvain Zimmer


Hello,

Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
Interestingly, it only seems to happen when reading Parquet data (I added a 
{{crash = True}} variable to show it)

This is an {{left_outer}} example, but it also crashes with a regular {{inner}} 
join.

{code}
import os

from pyspark import SparkContext
from pyspark.sql import types as SparkTypes
from pyspark.sql import SQLContext

sc = SparkContext()
sqlc = SQLContext(sc)

schema1 = SparkTypes.StructType([
SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
])
schema2 = SparkTypes.StructType([
SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
])

# Valid Long values (-9223372036854775808 < -5543241376386463808 , 
4661454128115150227 < 9223372036854775807)
data1 = [(4661454128115150227,), (-5543241376386463808,)]
data2 = [(650460285, )]

df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)

crash = True
if crash:
os.system("rm -rf /tmp/sparkbug")
df1.write.parquet("/tmp/sparkbug/vertex")
df2.write.parquet("/tmp/sparkbug/edge")

df1 = sqlc.read.load("/tmp/sparkbug/vertex")
df2 = sqlc.read.load("/tmp/sparkbug/edge")

result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")

# Should print [Row(id2=650460285, id1=None)]
print result_df.collect()
{code}

When ran with {{spark-submit}}, the final {{collect()}} call crashes with this:

{code}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o61.collectToPython.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
at 
org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 

[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394329#comment-15394329
 ] 

Marcelo Vanzin commented on SPARK-16725:


Again, if you code depends on a specific version of Guava, you should shade 
Guava instead of forcing the infrastructure to upgrade. Spark has to ship a 
Guava jar because Hadoop needs it - it sucks and the Hadoop guys want to fix 
that in version 3, but we're stuck with that for now.

Changing fro 14 to 16 will fix *your* use case, but what about someone who 
wants a different version? Saying "shade your custom dependencies" works for 
everyone, even though it might not be the optimal answer.

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Min Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394311#comment-15394311
 ] 

Min Wei commented on SPARK-16725:
-

I am not sure if the Hadoop jar is the issue, as I am just using the spark 
shell for some local testing. 

I "fixed" my test after upgrading the Guava in the Spark core jar per the diff 
file. Here is a way to repro it after building the spark-cassandra connector. 

   ./bin/spark-shell --jars 
./spark-cassandra-connector-assembly-1.6.0-27-g5760745.jar

The following exception goes away after the "upgrade". I agree that there would 
be other cases where the hadoop jar could cause problems. 

java.lang.IllegalStateException: Detected Guava issue #1635 which indicates 
that a version of Guava less than 16.01 is in use.  This introduces codec 
resolution issues and potentially other incompatibility issues in the driver.  
Please upgrade to Guava 16.01 or later.
  at com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:62)
  at com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36)
  at com.datastax.driver.core.Cluster.(Cluster.java:68)
  at 
com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:37)
  at 
com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:98)
  at 
com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:163)
  at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:157)
  at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:157)
  at 
com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:34)
  at 
com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:60)
  at 
com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:85)
  at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:114)
  at 
com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:127)
  at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:346)
  at 
com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:366)
  at 
com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:52)
  at 
com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
  at 
com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
  at 
com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:140)
  at 
com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
  at 
com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:246)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1280)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1275)
  at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:132)
  at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:133)
  at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1315)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.first(RDD.scala:1314)
  ... 52 elided



> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> 

[jira] [Updated] (SPARK-16718) gbm-style treeboost

2016-07-26 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16718:
--
Description: 
.As an initial minimal change, we should provide TreeBoost as implemented in 
GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
impurity, tree leafs in GBTs can have loss-optimal predictions for their 
partition of the data.

Commit should have evidence of accuracy improvment


  was:
As an initial minimal change, we should provide TreeBoost as implemented in GBM 
for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, 
tree leafs in GBTs can have loss-optimal predictions for their partition of the 
data.

Commit should have evidence of accuracy improvment



> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> .As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-15703.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15703:
--
Priority: Minor  (was: Critical)

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15703:
--
Issue Type: Improvement  (was: Bug)

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394271#comment-15394271
 ] 

Thomas Graves commented on SPARK-15703:
---

Using this jira to make lsitenerbus configurable and we can use the other jira 
SPARK-16441 to fix synchronization.

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15703:
--
Assignee: Dhruve Ashar

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable

2016-07-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15703:
--
Summary: Make ListenerBus event queue size configurable  (was: Spark UI 
doesn't show all tasks as completed when it should)

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16739) GBTClassifier should be a Classifier, not Predictor

2016-07-26 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16739:
-

 Summary: GBTClassifier should be a Classifier, not Predictor
 Key: SPARK-16739
 URL: https://issues.apache.org/jira/browse/SPARK-16739
 Project: Spark
  Issue Type: Improvement
Reporter: Vladimir Feinberg
Priority: Minor


Should probably wait for SPARK-4240 to be resolved first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394214#comment-15394214
 ] 

Marcelo Vanzin commented on SPARK-16725:


Spark depends on Guava 14 and shades it. The Guava you see on the classpath is 
because of Hadoop, and we can't change Hadoop. If the Cassandra connector needs 
Guava 16, it should shade it like Spark does.

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16736:


Assignee: (was: Apache Spark)

> remove redundant FileSystem status checks calls from Spark codebase
> ---
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are 
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS 
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be 
> eliminated by careful merging and pulling out the catching of 
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists 
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination 
> path is not present. That's a tested requirement of all Hadoop compatible FS 
> and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

2016-07-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394196#comment-15394196
 ] 

Apache Spark commented on SPARK-16736:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/14371

> remove redundant FileSystem status checks calls from Spark codebase
> ---
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are 
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS 
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be 
> eliminated by careful merging and pulling out the catching of 
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists 
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination 
> path is not present. That's a tested requirement of all Hadoop compatible FS 
> and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16736:


Assignee: Apache Spark

> remove redundant FileSystem status checks calls from Spark codebase
> ---
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are 
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS 
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be 
> eliminated by careful merging and pulling out the catching of 
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists 
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination 
> path is not present. That's a tested requirement of all Hadoop compatible FS 
> and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6948) VectorAssembler should choose dense/sparse for output based on number of zeros

2016-07-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394190#comment-15394190
 ] 

Sean Owen commented on SPARK-6948:
--

I tend to agree, ran into this just recently in the exact same context. It was 
surprising when a tiny vector was made sparse and then not usable with 
StandardScaler (with "subtract mean" enabled).

> VectorAssembler should choose dense/sparse for output based on number of zeros
> --
>
> Key: SPARK-6948
> URL: https://issues.apache.org/jira/browse/SPARK-6948
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.4.0
>
>
> Now VectorAssembler only outputs sparse vectors. We should choose 
> dense/sparse format automatically, whichever uses less memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6948) VectorAssembler should choose dense/sparse for output based on number of zeros

2016-07-26 Thread Max Moroz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394120#comment-15394120
 ] 

Max Moroz commented on SPARK-6948:
--

Should this be a parameter for `transform` method of the VectorAssembler 
instance (with the current behavior used as default)? For example, I saw people 
struggle with the `SparseVector` output from VectorAssembler when they wanted 
to later apply `StandardScaler` with de-meaning. They will end up with an 
unnecessary roundtrip conversion between `DenseVector` and `SparseVector`.

I *think* a similar extra conversion would occur if they later extract some of 
the features from `VectorAssembler` output, and combine with other features 
that happen to be dense. In that case, if the user knows in advance that 
`DenseVector` would be required, they might choose to go with it from the 
beginning.

> VectorAssembler should choose dense/sparse for output based on number of zeros
> --
>
> Key: SPARK-6948
> URL: https://issues.apache.org/jira/browse/SPARK-6948
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.4.0
>
>
> Now VectorAssembler only outputs sparse vectors. We should choose 
> dense/sparse format automatically, whichever uses less memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-26 Thread Daniel Barclay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394121#comment-15394121
 ] 

Daniel Barclay commented on SPARK-16652:


Not that I can really tell either, but this does not seem to be generated code 
that's invalid in the sense of not following the class file format or failing 
bytecode validation.

It just seems to be invalid in the sense of accessing the wrong memory (e.g., 
something accessed the wrong address, or allocation/deallocation got out of 
sync):

1. Recall that I initially saw "fault occurred in a recent unsafe memory 
access" exceptions (before I trimmed the test case down to its current form).

2. *Also, note how changing the string literal on line 63(?) in my test case 
changes things:
- With the current multi-character value, the program crashes.
- With a two-character value, it also crashes.
- With a one-character value, it runs.
- With a zero-character value (empty string), it runs.
- With a null value, it runs.

(When I was getting the "fault ... in ... unsafe" messages, a one-character 
value would crash, while an empty string or null would run.)




> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16676) Spark jobs stay in pending

2016-07-26 Thread Joe Chong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394094#comment-15394094
 ] 

Joe Chong commented on SPARK-16676:
---

Unfortunately, that's all the logs available in the console, as pasted above. 
Any idea on where else to look? 

> Spark jobs stay in pending
> --
>
> Key: SPARK-16676
> URL: https://issues.apache.org/jira/browse/SPARK-16676
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Shell
>Affects Versions: 1.5.2
> Environment: Mac OS X Yosemite, Terminal, Spark-shell standalone
>Reporter: Joe Chong
> Attachments: Spark UI stays @ pending.png
>
>
> I've been having issues executing certain Scala statements within the 
> Spark-Shell. These statements are obtained through tutorial/blog written by 
> Carol McDonald in MapR. 
> The import statements, reading text files into DataFrames are OK. However, 
> when I try to do df.show(), the execution hits a road block. Checking the 
> Spark UI job, I see that the Stage's active, however, 1 of its dependent job 
> stays in Pending without any movement. The logs are as below. 
> scala> fltCountsql.show()
> 16/07/22 11:40:16 INFO spark.SparkContext: Starting job: show at :46
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Registering RDD 31 (show at 
> :46)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Got job 4 (show at 
> :46) with 200 output partitions
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 
> 8(show at :46)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Parents of final stage: 
> List(ShuffleMapStage 7)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Missing parents: 
> List(ShuffleMapStage 7)
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 7 
> (MapPartitionsRDD[31] at show at :46), which has no missing parents
> 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(18128) called 
> with curMem=115755879, maxMem=2778495713
> 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5 stored as 
> values in memory (estimated size 17.7 KB, free 2.5 GB)
> 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(7527) called with 
> curMem=115774007, maxMem=2778495713
> 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5_piece0 stored 
> as bytes in memory (estimated size 7.4 KB, free 2.5 GB)
> 16/07/22 11:40:16 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in 
> memory on localhost:61408 (size: 7.4 KB, free: 2.5 GB)
> 16/07/22 11:40:16 INFO spark.SparkContext: Created broadcast 5 from broadcast 
> at DAGScheduler.scala:861
> 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks 
> from ShuffleMapStage 7 (MapPartitionsRDD[31] at show at :46)
> 16/07/22 11:40:16 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 
> 2 tasks
> 16/07/22 11:40:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 7.0 (TID 4, localhost, PROCESS_LOCAL, 2156 bytes)
> 16/07/22 11:40:16 INFO executor.Executor: Running task 0.0 in stage 7.0 (TID 
> 4)
> 16/07/22 11:40:16 INFO storage.BlockManager: Found block rdd_2_0 locally
> 16/07/22 11:40:17 INFO executor.Executor: Finished task 0.0 in stage 7.0 (TID 
> 4). 2738 bytes result sent to driver
> 16/07/22 11:40:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 7.0 (TID 4) in 920 ms on localhost (1/2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16725) Migrate Guava to 16+

2016-07-26 Thread Min Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394084#comment-15394084
 ] 

Min Wei commented on SPARK-16725:
-

Thanks for editing the JIRA fields properly. I used the first/default values in 
a haste. 

The diff I attached is meant for illustration purposes only. My intent is to 
have a discussion of the options. Shading is another option. 

Personally I would prefer the upgrade/non-shading option, I assume the Guava 
issue won't be the only troublesome jar down the road. Yes, it might open the 
can of worms on various versioning management issues. Given the status of Spark 
as a general platform, it would be a good problem to solve :-) 

My two cents. 



> Migrate Guava to 16+
> 
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16725) Migrate Guava to 16+?

2016-07-26 Thread Min Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Wei updated SPARK-16725:

Summary: Migrate Guava to 16+?  (was: Migrate Guava to 16+)

> Migrate Guava to 16+?
> -
>
> Key: SPARK-16725
> URL: https://issues.apache.org/jira/browse/SPARK-16725
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Min Wei
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently Spark depends on an old version of Guava, version 14. However 
> Spark-cassandra driver asserts on Guava version 16 and above. 
> It would be great to update the Guava dependency to version 16+
> diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
> b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> index f72c7de..abddafe 100644
> --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
> +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
> @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
>  import java.security.cert.X509Certificate
>  import javax.net.ssl._
>  
> -import com.google.common.hash.HashCodes
> +import com.google.common.hash.HashCode
>  import com.google.common.io.Files
>  import org.apache.hadoop.io.Text
>  
> @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
>  val secret = new Array[Byte](length)
>  rnd.nextBytes(secret)
>  
> -val cookie = HashCodes.fromBytes(secret).toString()
> +val cookie = HashCode.fromBytes(secret).toString()
>  SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
> cookie)
>  cookie
>} else {
> diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
> b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> index af50a6d..02545ae 100644
> --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
> +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
> @@ -72,7 +72,7 @@ class SparkEnv (
>  
>// A general, soft-reference map for metadata needed during HadoopRDD 
> split computation
>// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
> -  private[spark] val hadoopJobMetadata = new 
> MapMaker().softValues().makeMap[String, Any]()
> +  private[spark] val hadoopJobMetadata = new 
> MapMaker().weakValues().makeMap[String, Any]()
>  
>private[spark] var driverTmpDir: Option[String] = None
>  
> diff --git a/pom.xml b/pom.xml
> index d064cb5..7c3e036 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -368,8 +368,7 @@
>
>  com.google.guava
>  guava
> -14.0.1
> -provided
> +19.0
>
>
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-26 Thread Daniel Barclay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394082#comment-15394082
 ] 

Daniel Barclay commented on SPARK-16652:


> This could be reproduced at my side with Spark 1.6.1 and 1.6.2, but not with 
> the master branch.

Yes, I see it working on the master branch too.

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16738) Queryable state for Spark State Store

2016-07-26 Thread Mark Sumner (JIRA)
Mark Sumner created SPARK-16738:
---

 Summary: Queryable state for Spark State Store
 Key: SPARK-16738
 URL: https://issues.apache.org/jira/browse/SPARK-16738
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL, Streaming
Affects Versions: 2.0.0
Reporter: Mark Sumner


Spark 2.0 will introduce the new State Store to allow state managment outside 
the RDD model (see: SPARK-13809)

This proposal seeks to include a mechanism (in a future release) to expose this 
internal store to external applications for querying.

This would then make it possible to interact with aggregated state without 
needing to synchronously write (and read) to/from an external store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15271) Allow force pulling executor docker images

2016-07-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15271.
---
Resolution: Fixed

Issue resolved by pull request 14348
[https://github.com/apache/spark/pull/14348]

> Allow force pulling executor docker images
> --
>
> Key: SPARK-15271
> URL: https://issues.apache.org/jira/browse/SPARK-15271
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.6.1
>Reporter: Philipp Hoffmann
>Assignee: Philipp Hoffmann
> Fix For: 2.1.0
>
>
> Mesos agents by default will not pull docker images which are cached locally 
> already.
> Because of this, in order to run a mutable tag (like {{...:latest}}) from the 
> current version on the docker repository you have to explicitly tell the 
> Mesos agent to pull the image (force pull). Otherwise the Mesos agent will 
> run an old (cached version).
> The feature for force pulling the image was introduced in Mesos 0.22:
> https://github.com/apache/mesos/commit/8682569df528717ff5efb64da26b1b49c39c4efd
> This ticket is about making use of this feature in Spark in order to force 
> Mesos agents to pull the executors docker image.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16713) Limit codegen method size to 8KB

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16713:


Assignee: (was: Apache Spark)

> Limit codegen method size to 8KB
> 
>
> Key: SPARK-16713
> URL: https://issues.apache.org/jira/browse/SPARK-16713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
> Beyond 8K JIT won't compile and can cause performance degradation. We have 
> seen this for queries with wide schema (30+ fields), where 
> agg_doAggregateWithKeys() can be more than 8K. This is also a major reason 
> for performance regression when we enable fash aggregate hashmap (such as 
> using VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16713) Limit codegen method size to 8KB

2016-07-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393863#comment-15393863
 ] 

Apache Spark commented on SPARK-16713:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14370

> Limit codegen method size to 8KB
> 
>
> Key: SPARK-16713
> URL: https://issues.apache.org/jira/browse/SPARK-16713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
> Beyond 8K JIT won't compile and can cause performance degradation. We have 
> seen this for queries with wide schema (30+ fields), where 
> agg_doAggregateWithKeys() can be more than 8K. This is also a major reason 
> for performance regression when we enable fash aggregate hashmap (such as 
> using VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16713) Limit codegen method size to 8KB

2016-07-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16713:


Assignee: Apache Spark

> Limit codegen method size to 8KB
> 
>
> Key: SPARK-16713
> URL: https://issues.apache.org/jira/browse/SPARK-16713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Apache Spark
>
> Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
> Beyond 8K JIT won't compile and can cause performance degradation. We have 
> seen this for queries with wide schema (30+ fields), where 
> agg_doAggregateWithKeys() can be more than 8K. This is also a major reason 
> for performance regression when we enable fash aggregate hashmap (such as 
> using VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

2016-07-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393770#comment-15393770
 ] 

Steve Loughran commented on SPARK-16736:


See also HIVE-14323

> remove redundant FileSystem status checks calls from Spark codebase
> ---
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are 
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS 
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be 
> eliminated by careful merging and pulling out the catching of 
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists 
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination 
> path is not present. That's a tested requirement of all Hadoop compatible FS 
> and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase

2016-07-26 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-16736:
---
Summary: remove redundant FileSystem status checks calls from Spark 
codebase  (was: remove redundant FileSystem.exists() calls from Spark codebase)

> remove redundant FileSystem status checks calls from Spark codebase
> ---
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are 
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS 
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be 
> eliminated by careful merging and pulling out the catching of 
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists 
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination 
> path is not present. That's a tested requirement of all Hadoop compatible FS 
> and object stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16737) ListingFileCatalog comments about RPC calls in object store isn't correct

2016-07-26 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-16737:
--

 Summary: ListingFileCatalog comments about RPC calls in object 
store isn't correct
 Key: SPARK-16737
 URL: https://issues.apache.org/jira/browse/SPARK-16737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Steve Loughran
Priority: Trivial


The comment text which came in with SPARK-1612 says 

{code}
- Although S3/S3A/S3N file system can be quite slow for remote file metadata
  //   operations, calling `getFileBlockLocations` does no harm here 
since these file system
  //   implementations don't actually issue RPC for this method.
{code}
That doesn't hold for openstack swift , which can exhibit rack locality if the 
swift nodes are spread across the same racks as the application. It is safest 
to remove that comment:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16737) ListingFileCatalog comments about RPC calls in object store isn't correct

2016-07-26 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-16737:
---
Description: 
The comment text which came in with SPARK-16121 says 

{code}
- Although S3/S3A/S3N file system can be quite slow for remote file metadata
  //   operations, calling `getFileBlockLocations` does no harm here 
since these file system
  //   implementations don't actually issue RPC for this method.
{code}
That doesn't hold for openstack swift , which can exhibit rack locality if the 
swift nodes are spread across the same racks as the application. It is safest 
to remove that comment:

  was:
The comment text which came in with SPARK-1612 says 

{code}
- Although S3/S3A/S3N file system can be quite slow for remote file metadata
  //   operations, calling `getFileBlockLocations` does no harm here 
since these file system
  //   implementations don't actually issue RPC for this method.
{code}
That doesn't hold for openstack swift , which can exhibit rack locality if the 
swift nodes are spread across the same racks as the application. It is safest 
to remove that comment:


> ListingFileCatalog comments about RPC calls in object store isn't correct
> -
>
> Key: SPARK-16737
> URL: https://issues.apache.org/jira/browse/SPARK-16737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Steve Loughran
>Priority: Trivial
>
> The comment text which came in with SPARK-16121 says 
> {code}
> - Although S3/S3A/S3N file system can be quite slow for remote file metadata
>   //   operations, calling `getFileBlockLocations` does no harm here 
> since these file system
>   //   implementations don't actually issue RPC for this method.
> {code}
> That doesn't hold for openstack swift , which can exhibit rack locality if 
> the swift nodes are spread across the same racks as the application. It is 
> safest to remove that comment:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection

2016-07-26 Thread Marco Colombo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Colombo updated SPARK-16717:
--
Description: 
In JdbcRRD it was possible to use a function to get a JDBC connection. This 
allow an external handling of the connections while now this is no more 
possible with dataframes. 
Please consider an addition to Dataframes for using an externally provided 
connectionFactory (such as a connection pool) in order to make data loading 
more efficient, avoiding connection close/recreation. Connections should be 
taken from provided function and returned to a second function whenever no more 
used by the RRD. So this will make jdbc handling more efficient.

I.e. extending DataFrame class with a method like 
jdbc(Function0 getConnection, Function0 
releaseConnection(java.sql.Connection))


  was:
In JdbcRRD it was possible to use a function to get a JDBC connection. This 
allow an external handling of the connections while now this is no more 
possible with dataframes. 
Please consider an addition to Dataframes for using an externally provided 
connectionFactory (such as a connection pool) in order to make data loading 
more efficient, avoiding connection close/recreation. 


> Dataframe (jdbc) is missing a way to link and external function to get a 
> connection
> ---
>
> Key: SPARK-16717
> URL: https://issues.apache.org/jira/browse/SPARK-16717
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.2
>Reporter: Marco Colombo
>
> In JdbcRRD it was possible to use a function to get a JDBC connection. This 
> allow an external handling of the connections while now this is no more 
> possible with dataframes. 
> Please consider an addition to Dataframes for using an externally provided 
> connectionFactory (such as a connection pool) in order to make data loading 
> more efficient, avoiding connection close/recreation. Connections should be 
> taken from provided function and returned to a second function whenever no 
> more used by the RRD. So this will make jdbc handling more efficient.
> I.e. extending DataFrame class with a method like 
> jdbc(Function0 getConnection, Function0 
> releaseConnection(java.sql.Connection))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection

2016-07-26 Thread Marco Colombo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393696#comment-15393696
 ] 

Marco Colombo commented on SPARK-16717:
---

Typo, I mean 1.6.2

> Dataframe (jdbc) is missing a way to link and external function to get a 
> connection
> ---
>
> Key: SPARK-16717
> URL: https://issues.apache.org/jira/browse/SPARK-16717
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.2
>Reporter: Marco Colombo
>
> In JdbcRRD it was possible to use a function to get a JDBC connection. This 
> allow an external handling of the connections while now this is no more 
> possible with dataframes. 
> Please consider an addition to Dataframes for using an externally provided 
> connectionFactory (such as a connection pool) in order to make data loading 
> more efficient, avoiding connection close/recreation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection

2016-07-26 Thread Marco Colombo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Colombo updated SPARK-16717:
--
Affects Version/s: (was: 1.3.0)
   1.6.2

> Dataframe (jdbc) is missing a way to link and external function to get a 
> connection
> ---
>
> Key: SPARK-16717
> URL: https://issues.apache.org/jira/browse/SPARK-16717
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.2
>Reporter: Marco Colombo
>
> In JdbcRRD it was possible to use a function to get a JDBC connection. This 
> allow an external handling of the connections while now this is no more 
> possible with dataframes. 
> Please consider an addition to Dataframes for using an externally provided 
> connectionFactory (such as a connection pool) in order to make data loading 
> more efficient, avoiding connection close/recreation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-07-26 Thread Liang Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393639#comment-15393639
 ] 

Liang Ke edited comment on SPARK-16735 at 7/26/16 12:00 PM:


hi, if some one can help me to push my patch to github ? 
thx a lot : )


was (Author: biglobster):
hi, if some one can help me to push my patch to github? thx alot:)

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
> Attachments: SPARK-16735.patch
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >