[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-12-06 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728042#comment-15728042
 ] 

Song Jun commented on SPARK-18609:
--

I'm working on this~

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> 

[jira] [Resolved] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18763.
---
   Resolution: Invalid
Fix Version/s: (was: 1.6.0)

> What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
> -
>
> Key: SPARK-18763
> URL: https://issues.apache.org/jira/browse/SPARK-18763
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: lklong
>Priority: Minor
>  Labels: beginner
>
> hi,spark team
> i have a question about "decision tree in mllib",i want to know What 
> algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
> please help me!
> thanks very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?

2016-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728037#comment-15728037
 ] 

Sean Owen commented on SPARK-18763:
---

Please ask questions on the mailing list. 

> What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
> -
>
> Key: SPARK-18763
> URL: https://issues.apache.org/jira/browse/SPARK-18763
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: lklong
>Priority: Minor
>  Labels: beginner
> Fix For: 1.6.0
>
>
> hi,spark team
> i have a question about "decision tree in mllib",i want to know What 
> algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
> please help me!
> thanks very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18764) Add a warning log when skipping a corrupted file

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18764:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add a warning log when skipping a corrupted file
> 
>
> Key: SPARK-18764
> URL: https://issues.apache.org/jira/browse/SPARK-18764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18764) Add a warning log when skipping a corrupted file

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18764:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add a warning log when skipping a corrupted file
> 
>
> Key: SPARK-18764
> URL: https://issues.apache.org/jira/browse/SPARK-18764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18764) Add a warning log when skipping a corrupted file

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728021#comment-15728021
 ] 

Apache Spark commented on SPARK-18764:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16192

> Add a warning log when skipping a corrupted file
> 
>
> Key: SPARK-18764
> URL: https://issues.apache.org/jira/browse/SPARK-18764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18764) Add a warning log when skipping a corrupted file

2016-12-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18764:
-
Affects Version/s: 2.1.0

> Add a warning log when skipping a corrupted file
> 
>
> Key: SPARK-18764
> URL: https://issues.apache.org/jira/browse/SPARK-18764
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18764) Add a warning log when skipping a corrupted file

2016-12-06 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18764:


 Summary: Add a warning log when skipping a corrupted file
 Key: SPARK-18764
 URL: https://issues.apache.org/jira/browse/SPARK-18764
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?

2016-12-06 Thread lklong (JIRA)
lklong created SPARK-18763:
--

 Summary: What algorithm is used in spark decision tree (is ID3, 
C4.5 or CART)?
 Key: SPARK-18763
 URL: https://issues.apache.org/jira/browse/SPARK-18763
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 1.6.0
Reporter: lklong
Priority: Minor
 Fix For: 1.6.0


hi,spark team
i have a question about "decision tree in mllib",i want to know What algorithm 
is used in spark decision tree (is ID3, C4.5 or CART)?
please help me!
thanks very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.

2016-12-06 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-18759.
---
Resolution: Duplicate

duplicate to SPARK-18703

> when use spark streaming with sparksql, lots of temp directories are created.
> -
>
> Key: SPARK-18759
> URL: https://issues.apache.org/jira/browse/SPARK-18759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Albert Cheng
>
> When use spark streaming with sparksql to insert records into existed hive 
> table. there are lots of temp directories created. Those directories are 
> deleted only when jvm exits. But if using sparksql with spark streaming, jvm 
> will work 7*24 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18762:


Assignee: Apache Spark

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727990#comment-15727990
 ] 

Apache Spark commented on SPARK-18762:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/16190

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18762:


Assignee: (was: Apache Spark)

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727986#comment-15727986
 ] 

Kousuke Saruta commented on SPARK-18762:


Yeah of course.

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727983#comment-15727983
 ] 

Apache Spark commented on SPARK-18761:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/16190

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727981#comment-15727981
 ] 

Xiangrui Meng commented on SPARK-18762:
---

Thanks! Please make sure spark history server still works  when ssl is enabled.

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727975#comment-15727975
 ] 

Liang-Chi Hsieh commented on SPARK-18756:
-

As we already upgrade to 4.0.42.Final, this should not be a problem now.

> Memory leak in Spark streaming
> --
>
> Key: SPARK-18756
> URL: https://issues.apache.org/jira/browse/SPARK-18756
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Udit Mehrotra
>
> We have a Spark streaming application, that processes data from Kinesis.
> In our application we are observing a memory leak at the Executors with Netty 
> buffers not being released properly, when the Spark BlockManager tries to 
> replicate the input blocks received from Kinesis stream. The leak occurs, 
> when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
> However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
> creating a replica, we do not observe the leak any more. We were able to 
> detect the leak, and obtain the stack trace by running the executors with an 
> additional JVM option: -Dio.netty.leakDetectionLevel=advanced.
> Here is the stack trace of the leak:
> 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 0
> Created at:
>   io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)
>   
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)
>   
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702)
>   
> org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>   org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>   
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
> We also observe a continuous increase in off heap memory usage at the 
> executors. Any help would be appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727972#comment-15727972
 ] 

Liang-Chi Hsieh commented on SPARK-18756:
-

I believe this bug is fixed by https://github.com/netty/netty/pull/5605 which 
is included in 4.0.41.Final.


> Memory leak in Spark streaming
> --
>
> Key: SPARK-18756
> URL: https://issues.apache.org/jira/browse/SPARK-18756
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Udit Mehrotra
>
> We have a Spark streaming application, that processes data from Kinesis.
> In our application we are observing a memory leak at the Executors with Netty 
> buffers not being released properly, when the Spark BlockManager tries to 
> replicate the input blocks received from Kinesis stream. The leak occurs, 
> when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
> However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
> creating a replica, we do not observe the leak any more. We were able to 
> detect the leak, and obtain the stack trace by running the executors with an 
> additional JVM option: -Dio.netty.leakDetectionLevel=advanced.
> Here is the stack trace of the leak:
> 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 0
> Created at:
>   io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)
>   
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)
>   
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702)
>   
> org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>   org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>   
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
> We also observe a continuous increase in off heap memory usage at the 
> executors. Any help would be appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18713) using SparkR build step wise regression model (glm)

2016-12-06 Thread Prasann modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasann modi updated SPARK-18713:
-
Comment: was deleted

(was: Can u add step wise regression function into upcoming Spark version.)

> using SparkR build step wise regression model (glm)
> ---
>
> Key: SPARK-18713
> URL: https://issues.apache.org/jira/browse/SPARK-18713
> Project: Spark
>  Issue Type: Bug
>Reporter: Prasann modi
>
> In R to build Step wise regression model
> step(glm(formula,data,family),direction = "forward")) 
> function is there. How to build stepwise regression model using  SparkR..
> I am using SPARK 2.0.0 and R 3.3.1..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18713) using SparkR build step wise regression model (glm)

2016-12-06 Thread Prasann modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasann modi reopened SPARK-18713:
--

Can you add step wise regression function into upcoming Spark version.

> using SparkR build step wise regression model (glm)
> ---
>
> Key: SPARK-18713
> URL: https://issues.apache.org/jira/browse/SPARK-18713
> Project: Spark
>  Issue Type: Bug
>Reporter: Prasann modi
>
> In R to build Step wise regression model
> step(glm(formula,data,family),direction = "forward")) 
> function is there. How to build stepwise regression model using  SparkR..
> I am using SPARK 2.0.0 and R 3.3.1..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727963#comment-15727963
 ] 

Kousuke Saruta commented on SPARK-18762:


[~mengxr] Ah... O.K, I'll submit a PR to revert it.

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.

2016-12-06 Thread Albert Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727958#comment-15727958
 ] 

Albert Cheng commented on SPARK-18759:
--

[~viirya] is right, this issue is duplicate  to SPARK-18703. Please close this 
issue.

> when use spark streaming with sparksql, lots of temp directories are created.
> -
>
> Key: SPARK-18759
> URL: https://issues.apache.org/jira/browse/SPARK-18759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Albert Cheng
>
> When use spark streaming with sparksql to insert records into existed hive 
> table. there are lots of temp directories created. Those directories are 
> deleted only when jvm exits. But if using sparksql with spark streaming, jvm 
> will work 7*24 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18713) using SparkR build step wise regression model (glm)

2016-12-06 Thread Prasann modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727956#comment-15727956
 ] 

Prasann modi commented on SPARK-18713:
--

Can u add step wise regression function into upcoming Spark version.

> using SparkR build step wise regression model (glm)
> ---
>
> Key: SPARK-18713
> URL: https://issues.apache.org/jira/browse/SPARK-18713
> Project: Spark
>  Issue Type: Bug
>Reporter: Prasann modi
>
> In R to build Step wise regression model
> step(glm(formula,data,family),direction = "forward")) 
> function is there. How to build stepwise regression model using  SparkR..
> I am using SPARK 2.0.0 and R 3.3.1..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Description: 
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this introduces several broken links in the UI. For example, 
in the master UI, the worker link is https:8081 instead of http:8081 or 
https:8481.

  was:
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this cause several broken links in the UI. For example, in 
the master UI, the worker link is https:8081 instead of http:8081 or https:8481.


> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727929#comment-15727929
 ] 

Xiangrui Meng edited comment on SPARK-18762 at 12/7/16 6:56 AM:


cc [~hayashidac] [~sarutak] [~lian cheng]


was (Author: mengxr):
cc [~hayashidac] [~sarutak]

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727929#comment-15727929
 ] 

Xiangrui Meng commented on SPARK-18762:
---

cc [~hayashidac] [~sarutak]

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Description: 
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.

More importantly, this cause several broken links in the UI. For example, in 
the master UI, the worker link is https:8081 instead of http:8081 or https:8481.

  was:
When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.


> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this cause several broken links in the UI. For example, in 
> the master UI, the worker link is https:8081 instead of http:8081 or 
> https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Priority: Blocker  (was: Critical)

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18762:
--
Priority: Critical  (was: Major)

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-18762:
-

 Summary: Web UI should be http:4040 instead of https:4040
 Key: SPARK-18762
 URL: https://issues.apache.org/jira/browse/SPARK-18762
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Web UI
Affects Versions: 2.1.0
Reporter: Xiangrui Meng


When SSL is enabled, the Spark shell shows:

{code}
Spark context Web UI available at https://192.168.99.1:4040
{code}

This is wrong because 4040 is http, not https. It redirects to the https port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.

2016-12-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727899#comment-15727899
 ] 

Liang-Chi Hsieh commented on SPARK-18759:
-

I think this is duplicate to SPARK-18703.

> when use spark streaming with sparksql, lots of temp directories are created.
> -
>
> Key: SPARK-18759
> URL: https://issues.apache.org/jira/browse/SPARK-18759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Albert Cheng
>
> When use spark streaming with sparksql to insert records into existed hive 
> table. there are lots of temp directories created. Those directories are 
> deleted only when jvm exits. But if using sparksql with spark streaming, jvm 
> will work 7*24 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18761:


Assignee: Apache Spark  (was: Josh Rosen)

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18761:


Assignee: Josh Rosen  (was: Apache Spark)

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727896#comment-15727896
 ] 

Apache Spark commented on SPARK-18761:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16189

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-06 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18761:
--

 Summary: Uncancellable / unkillable tasks may starve jobs of 
resoures
 Key: SPARK-18761
 URL: https://issues.apache.org/jira/browse/SPARK-18761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark's current task cancellation / task killing mechanism is "best effort" in 
the sense that some tasks may not be interruptible and may not respond to their 
"killed" flags being set. If a significant fraction of a cluster's task slots 
are occupied by tasks that have been marked as killed but remain running then 
this can lead to a situation where new jobs and tasks are starved of resources 
because zombie tasks are holding resources.

I propose to address this problem by introducing a "task reaper" mechanism in 
executors to monitor tasks after they are marked for killing in order to 
periodically re-attempt the task kill, capture and log stacktraces / warnings 
if tasks do not exit in a timely manner, and, optionally, kill the entire 
executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

2016-12-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727763#comment-15727763
 ] 

Dongjoon Hyun edited comment on SPARK-18709 at 12/7/16 5:32 AM:


Hi, [~srowen]
The type verification was introduced by 
https://issues.apache.org/jira/browse/SPARK-14945  when `session.py` is created 
in 2.0.0.


was (Author: dongjoon):
@srowen . 
The type verification was introduced by 
https://issues.apache.org/jira/browse/SPARK-14945  when `session.py` is created 
in 2.0.0.

> Automatic null conversion bug (instead of throwing error) when creating a 
> Spark Datarame with incompatible types for fields.
> 
>
> Key: SPARK-18709
> URL: https://issues.apache.org/jira/browse/SPARK-18709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Amogh Param
>  Labels: bug
> Fix For: 2.0.0
>
>
> When converting an RDD with a `float` type field to a spark dataframe with an 
> `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently 
> convert the field values to nulls instead of throwing an error like `LongType 
> can not accept object ___ in type `. However, this seems to be 
> fixed in Spark 2.0.2.
> The following example should make the problem clear:
> {code}
> from pyspark.sql.types import StructField, StructType, LongType, DoubleType
> schema = StructType([
> StructField("0", LongType(), True),
> StructField("1", DoubleType(), True),
> ])
> data = [[1.0, 1.0], [nan, 2.0]]
> spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
> spark_df.show()
> {code}
> Instead of throwing an error like:
> {code}
> LongType can not accept object 1.0 in type 
> {code}
> Spark converts all the values in the first column to nulls
> Running `spark_df.show()` gives:
> {code}
> ++---+
> |   0|  1|
> ++---+
> |null|1.0|
> |null|1.0|
> ++---+
> {code}
> For the purposes of my computation, I'm doing a `mapPartitions` on a spark 
> data frame, and for each partition, converting it into a pandas data frame, 
> doing a few computations on this pandas dataframe and the return value will 
> be a list of lists, which is converted to an RDD while being returned from 
> 'mapPartitions' (for all partitions). This RDD is then converted into a spark 
> dataframe similar to the example above, using 
> `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should 
> be converted to a `LongType` in the spark data frame, but since it has 
> missing values, it is a `float` type. When spark tries to create the data 
> frame, it converts all the values in that column to nulls instead of throwing 
> an error that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

2016-12-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727763#comment-15727763
 ] 

Dongjoon Hyun commented on SPARK-18709:
---

@srowen . 
The type verification was introduced by 
https://issues.apache.org/jira/browse/SPARK-14945  when `session.py` is created 
in 2.0.0.

> Automatic null conversion bug (instead of throwing error) when creating a 
> Spark Datarame with incompatible types for fields.
> 
>
> Key: SPARK-18709
> URL: https://issues.apache.org/jira/browse/SPARK-18709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Amogh Param
>  Labels: bug
> Fix For: 2.0.0
>
>
> When converting an RDD with a `float` type field to a spark dataframe with an 
> `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently 
> convert the field values to nulls instead of throwing an error like `LongType 
> can not accept object ___ in type `. However, this seems to be 
> fixed in Spark 2.0.2.
> The following example should make the problem clear:
> {code}
> from pyspark.sql.types import StructField, StructType, LongType, DoubleType
> schema = StructType([
> StructField("0", LongType(), True),
> StructField("1", DoubleType(), True),
> ])
> data = [[1.0, 1.0], [nan, 2.0]]
> spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
> spark_df.show()
> {code}
> Instead of throwing an error like:
> {code}
> LongType can not accept object 1.0 in type 
> {code}
> Spark converts all the values in the first column to nulls
> Running `spark_df.show()` gives:
> {code}
> ++---+
> |   0|  1|
> ++---+
> |null|1.0|
> |null|1.0|
> ++---+
> {code}
> For the purposes of my computation, I'm doing a `mapPartitions` on a spark 
> data frame, and for each partition, converting it into a pandas data frame, 
> doing a few computations on this pandas dataframe and the return value will 
> be a list of lists, which is converted to an RDD while being returned from 
> 'mapPartitions' (for all partitions). This RDD is then converted into a spark 
> dataframe similar to the example above, using 
> `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should 
> be converted to a `LongType` in the spark data frame, but since it has 
> missing values, it is a `float` type. When spark tries to create the data 
> frame, it converts all the values in that column to nulls instead of throwing 
> an error that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18760) Provide consistent format output for all file formats

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18760:


Assignee: Reynold Xin  (was: Apache Spark)

> Provide consistent format output for all file formats
> -
>
> Key: SPARK-18760
> URL: https://issues.apache.org/jira/browse/SPARK-18760
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently rely on FileFormat implementations to override toString in order 
> to get a proper explain output. It'd be better to just depend on shortName 
> for those.
> Before:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> After:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: text, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18760) Provide consistent format output for all file formats

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727746#comment-15727746
 ] 

Apache Spark commented on SPARK-18760:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16187

> Provide consistent format output for all file formats
> -
>
> Key: SPARK-18760
> URL: https://issues.apache.org/jira/browse/SPARK-18760
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently rely on FileFormat implementations to override toString in order 
> to get a proper explain output. It'd be better to just depend on shortName 
> for those.
> Before:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> After:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: text, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18760) Provide consistent format output for all file formats

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18760:


Assignee: Apache Spark  (was: Reynold Xin)

> Provide consistent format output for all file formats
> -
>
> Key: SPARK-18760
> URL: https://issues.apache.org/jira/browse/SPARK-18760
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently rely on FileFormat implementations to override toString in order 
> to get a proper explain output. It'd be better to just depend on shortName 
> for those.
> Before:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> After:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: text, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18760) Provide consistent format output for all file formats

2016-12-06 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18760:
---

 Summary: Provide consistent format output for all file formats
 Key: SPARK-18760
 URL: https://issues.apache.org/jira/browse/SPARK-18760
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently rely on FileFormat implementations to override toString in order 
to get a proper explain output. It'd be better to just depend on shortName for 
those.

Before:
{noformat}
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: 
org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
{noformat}

After:
{noformat}
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: text, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11482) Maven repo in IsolatedClientLoader should be configurable.

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-11482.
---
Resolution: Later

> Maven repo in IsolatedClientLoader should be configurable. 
> ---
>
> Key: SPARK-11482
> URL: https://issues.apache.org/jira/browse/SPARK-11482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>Reporter: Doug Balog
>Priority: Minor
>
> The maven repo used to fetch the hive jars and dependencies is hard coded.
> A user should be able to override it via configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7263.
--
Resolution: Later

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-8398.
--
Resolution: Later

> Consistently expose Hadoop Configuration/JobConf parameters for Hadoop 
> input/output formats
> ---
>
> Key: SPARK-8398
> URL: https://issues.apache.org/jira/browse/SPARK-8398
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: koert kuipers
>Priority: Trivial
>
> Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
> few functions that use Hadoop input formats to read or Hadoop output formats 
> to write data. The goal of this JIRA is to make this consistent and expose 
> Configuration/JobConf for all these methods, which facilitates re-use and 
> discourages many additional parameters (that end up changing the 
> Configuration/JobConf internally). 
> See also:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18678) Skewed reservoir sampling in SamplingUtils

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18678:
--
Summary: Skewed reservoir sampling in SamplingUtils  (was: Skewed feature 
subsampling in Random forest)

> Skewed reservoir sampling in SamplingUtils
> --
>
> Key: SPARK-18678
> URL: https://issues.apache.org/jira/browse/SPARK-18678
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Bjoern Toldbod
>
> The feature subsampling performed in the RandomForest-implementation from 
> org.apache.spark.ml.tree.impl.RandomForest
> is performed using SamplingUtils.reservoirSampleAndCount
> The implementation of the sampling skews feature selection in favor of 
> features with a higher index. 
> The skewness is smaller for a large number of features, but completely 
> dominates the feature selection for a small number of features. The extreme 
> case is when the number of features is 2 and number of features to select is 
> 1.
> In this case the feature sampling will always pick feature 1 and ignore 
> feature 0.
> Of course this produces low quality models for few features when using 
> subsampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16948) Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16948.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Use metastore schema instead of inferring schema for ORC in 
> HiveMetastoreCatalog
> 
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.

2016-12-06 Thread Albert Cheng (JIRA)
Albert Cheng created SPARK-18759:


 Summary: when use spark streaming with sparksql, lots of temp 
directories are created.
 Key: SPARK-18759
 URL: https://issues.apache.org/jira/browse/SPARK-18759
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Albert Cheng


When use spark streaming with sparksql to insert records into existed hive 
table. there are lots of temp directories created. Those directories are 
deleted only when jvm exits. But if using sparksql with spark streaming, jvm 
will work 7*24 hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727580#comment-15727580
 ] 

Apache Spark commented on SPARK-18758:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16186

> StreamingQueryListener events from a StreamingQuery should be sent only to 
> the listeners in the same session as the query
> -
>
> Key: SPARK-18758
> URL: https://issues.apache.org/jira/browse/SPARK-18758
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Priority: Critical
>
> Listeners added with `sparkSession.streams.addListener(l)` are added to a 
> SparkSession. So events only from queries in the same session as a listener 
> should be posted to the listener.
> Currently, all the events gets routed through the Spark's main listener bus, 
> and therefore all StreamingQueryListener events gets posted to 
> StreamingQueryListeners in all sessions. This is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18758:


Assignee: (was: Apache Spark)

> StreamingQueryListener events from a StreamingQuery should be sent only to 
> the listeners in the same session as the query
> -
>
> Key: SPARK-18758
> URL: https://issues.apache.org/jira/browse/SPARK-18758
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Priority: Critical
>
> Listeners added with `sparkSession.streams.addListener(l)` are added to a 
> SparkSession. So events only from queries in the same session as a listener 
> should be posted to the listener.
> Currently, all the events gets routed through the Spark's main listener bus, 
> and therefore all StreamingQueryListener events gets posted to 
> StreamingQueryListeners in all sessions. This is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18758:


Assignee: Apache Spark

> StreamingQueryListener events from a StreamingQuery should be sent only to 
> the listeners in the same session as the query
> -
>
> Key: SPARK-18758
> URL: https://issues.apache.org/jira/browse/SPARK-18758
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Listeners added with `sparkSession.streams.addListener(l)` are added to a 
> SparkSession. So events only from queries in the same session as a listener 
> should be posted to the listener.
> Currently, all the events gets routed through the Spark's main listener bus, 
> and therefore all StreamingQueryListener events gets posted to 
> StreamingQueryListeners in all sessions. This is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18758:
--
Description: 
Listeners added with `sparkSession.streams.addListener(l)` are added to a 
SparkSession. So events only from queries in the same session as a listener 
should be posted to the listener.
Currently, all the events gets routed through the Spark's main listener bus, 
and therefore all StreamingQueryListener events gets posted to 
StreamingQueryListeners in all sessions. This is wrong.

  was:Listeners added with `sparkSession.streams.addListener(l)` are added to a 
SparkSession. So events only from queries in the same session as a listener 
should be posted to the listener.


> StreamingQueryListener events from a StreamingQuery should be sent only to 
> the listeners in the same session as the query
> -
>
> Key: SPARK-18758
> URL: https://issues.apache.org/jira/browse/SPARK-18758
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Priority: Critical
>
> Listeners added with `sparkSession.streams.addListener(l)` are added to a 
> SparkSession. So events only from queries in the same session as a listener 
> should be posted to the listener.
> Currently, all the events gets routed through the Spark's main listener bus, 
> and therefore all StreamingQueryListener events gets posted to 
> StreamingQueryListeners in all sessions. This is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-06 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727542#comment-15727542
 ] 

Liang-Chi Hsieh commented on SPARK-18539:
-

[~lian cheng], in Parquet's code, looks like a null column still can have its 
ColumnChunkMetaData. It won't cause problem even before PARQUET-389, because 
Parquet will check if all values in the chunk are null.

PARQUET-389 resolves the case there is no ColumnChunkMetaData for a column, 
i.e., the column is missing from the Parquet file.

So I am not sure is, in a Parquet file, can a nullable column have no 
ColumnChunkMetaData like you said?

Appreciate if you can clarify it. Thanks.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> 

[jira] [Created] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-06 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-18758:
-

 Summary: StreamingQueryListener events from a StreamingQuery 
should be sent only to the listeners in the same session as the query
 Key: SPARK-18758
 URL: https://issues.apache.org/jira/browse/SPARK-18758
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.2
Reporter: Tathagata Das
Priority: Critical


Listeners added with `sparkSession.streams.addListener(l)` are added to a 
SparkSession. So events only from queries in the same session as a listener 
should be posted to the listener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18757:
-
Description: 
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In 
it, I try to copy the hierarchy from the scala side.
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and 
make clustering algs inherit it. Then, in the python side, we copy the 
hierarchy so that we dont need to add setters manually for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]

  was:
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In 
it, I try to copy the hierarchy from the scala side.
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and 
make clustering algs inherit it. Then, in the python side, we copy the 
hierarchy so that we dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]


> Models in Pyspark support column setters
> 
>
> Key: SPARK-18757
> URL: https://issues.apache.org/jira/browse/SPARK-18757
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Recently, I found three places in which column setters are missing: 
> KMeansModel, BisectingKMeansModel and OneVsRestModel.
> These three models directly inherit `Model` which dont have columns setters, 
> so I had to add the missing setters manually in [SPARK-18625] and 
> [SPARK-18520].
> Fow now, models in pyspark still don't support column setters at all.
> I suggest that we keep the hierarchy of pyspark models in line with that in 
> the scala side:
> For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. 
> In it, I try to copy the hierarchy from the scala side.
> For clustering algs, I think we may first create abstract classes 
> {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, 
> and make clustering algs inherit it. Then, in the python side, we copy the 
> hierarchy so that we dont need to add setters manually for each alg.
> For features algs, we can also use a abstract class {{FeatureModel}} in scala 
> side, and do the same thing.
> What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727496#comment-15727496
 ] 

Apache Spark commented on SPARK-18753:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16184

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18753:


Assignee: Apache Spark

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18757:
-
Description: 
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In 
it, I try to copy the hierarchy from the scala side.
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and 
make clustering algs inherit it. Then, in the python side, we copy the 
hierarchy so that we dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]

  was:
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In 
it, I try to copy the hierarchy from the scala side.
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]


> Models in Pyspark support column setters
> 
>
> Key: SPARK-18757
> URL: https://issues.apache.org/jira/browse/SPARK-18757
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Recently, I found three places in which column setters are missing: 
> KMeansModel, BisectingKMeansModel and OneVsRestModel.
> These three models directly inherit `Model` which dont have columns setters, 
> so I had to add the missing setters manually in [SPARK-18625] and 
> [SPARK-18520].
> Fow now, models in pyspark still don't support column setters at all.
> I suggest that we keep the hierarchy of pyspark models in line with that in 
> the scala side:
> For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. 
> In it, I try to copy the hierarchy from the scala side.
> For clustering algs, I think we may first create abstract classes 
> {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, 
> and make clustering algs inherit it. Then, in the python side, we copy the 
> hierarchy so that we dont need to add setters for each alg.
> For features algs, we can also use a abstract class {{FeatureModel}} in scala 
> side, and do the same thing.
> What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18753:


Assignee: (was: Apache Spark)

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18757:
-
Description: 
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In 
it, I try to copy the hierarchy from the scala side.
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]

  was:
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]


> Models in Pyspark support column setters
> 
>
> Key: SPARK-18757
> URL: https://issues.apache.org/jira/browse/SPARK-18757
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Recently, I found three places in which column setters are missing: 
> KMeansModel, BisectingKMeansModel and OneVsRestModel.
> These three models directly inherit `Model` which dont have columns setters, 
> so I had to add the missing setters manually in [SPARK-18625] and 
> [SPARK-18520].
> Fow now, models in pyspark still don't support column setters at all.
> I suggest that we keep the hierarchy of pyspark models in line with that in 
> the scala side:
> For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. 
> In it, I try to copy the hierarchy from the scala side.
> For clustering algs, I think we may first create abstract classes 
> {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
> algs inherit it. Then, in the python side, we copy the hierarchy so that we 
> dont need to add setters for each alg.
> For features algs, we can also use a abstract class {{FeatureModel}} in scala 
> side, and do the same thing.
> What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18757:
-
Description: 
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and OneVsRestModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]

  was:
Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and BisectingKMeansModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]


> Models in Pyspark support column setters
> 
>
> Key: SPARK-18757
> URL: https://issues.apache.org/jira/browse/SPARK-18757
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Recently, I found three places in which column setters are missing: 
> KMeansModel, BisectingKMeansModel and OneVsRestModel.
> These three models directly inherit `Model` which dont have columns setters, 
> so I had to add the missing setters manually in [SPARK-18625] and 
> [SPARK-18520].
> Fow now, models in pyspark still don't support column setters at all.
> I suggest that we keep the hierarchy of pyspark models in line with that in 
> the scala side:
> For classifiation and regression algs, I‘m making a trial in [SPARK-18379]
> For clustering algs, I think we may first create abstract classes 
> {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
> algs inherit it. Then, in the python side, we copy the hierarchy so that we 
> dont need to add setters for each alg.
> For features algs, we can also use a abstract class {{FeatureModel}} in scala 
> side, and do the same thing.
> What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18757) Models in Pyspark support column setters

2016-12-06 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18757:


 Summary: Models in Pyspark support column setters
 Key: SPARK-18757
 URL: https://issues.apache.org/jira/browse/SPARK-18757
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, PySpark
Reporter: zhengruifeng


Recently, I found three places in which column setters are missing: 
KMeansModel, BisectingKMeansModel and BisectingKMeansModel.
These three models directly inherit `Model` which dont have columns setters, so 
I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520].
Fow now, models in pyspark still don't support column setters at all.
I suggest that we keep the hierarchy of pyspark models in line with that in the 
scala side:
For classifiation and regression algs, I‘m making a trial in [SPARK-18379]
For clustering algs, I think we may first create abstract classes 
{{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering 
algs inherit it. Then, in the python side, we copy the hierarchy so that we 
dont need to add setters for each alg.
For features algs, we can also use a abstract class {{FeatureModel}} in scala 
side, and do the same thing.

What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727475#comment-15727475
 ] 

Marcelo Vanzin commented on SPARK-18085:


I'm not trying to flame you. I'm trying to point out that the issues you 
raised, while valid on their own, are not related to the problem described in 
this bug, and trying to discuss those here is counter-productive. If you care 
about those you should open separate bugs.

The SHS memory issues are not caused by the event log format nor by its size. 
The SHS does not load the whole event log into memory, not does it keep any 
JSON-formatted anything in memory. So the fact that the event logs are in JSON 
is not relevant to how much memory the SHS is using.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727466#comment-15727466
 ] 

Sean Owen commented on SPARK-18756:
---

CC [~zsxwing] is this related to the netty byte buffer stuff you've been 
dealing with for a while?

> Memory leak in Spark streaming
> --
>
> Key: SPARK-18756
> URL: https://issues.apache.org/jira/browse/SPARK-18756
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Udit Mehrotra
>
> We have a Spark streaming application, that processes data from Kinesis.
> In our application we are observing a memory leak at the Executors with Netty 
> buffers not being released properly, when the Spark BlockManager tries to 
> replicate the input blocks received from Kinesis stream. The leak occurs, 
> when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
> However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
> creating a replica, we do not observe the leak any more. We were able to 
> detect the leak, and obtain the stack trace by running the executors with an 
> additional JVM option: -Dio.netty.leakDetectionLevel=advanced.
> Here is the stack trace of the leak:
> 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 0
> Created at:
>   io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)
>   
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)
>   
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)
>   
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702)
>   
> org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>   
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>   org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)
>   
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>   
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>   
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
> We also observe a continuous increase in off heap memory usage at the 
> executors. Any help would be appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2016-12-06 Thread Dmitry Buzolin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727446#comment-15727446
 ] 

Dmitry Buzolin commented on SPARK-18085:


I posted my comments not to start the endless flame on what is orthogonal and 
what is not.
It is up to you how to use them. I speak from my experience running Spark 
clusters of substantial sizes.
If you think offloading problem from memory to disk storage is a way to go - do 
it. I'd be happy to see SHS performance improvements in next Spark release.


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Udit Mehrotra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-18756:
--
Description: 
We have a Spark streaming application, that processes data from Kinesis.

In our application we are observing a memory leak at the Executors with Netty 
buffers not being released properly, when the Spark BlockManager tries to 
replicate the input blocks received from Kinesis stream. The leak occurs, when 
we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
creating a replica, we do not observe the leak any more. We were able to detect 
the leak, and obtain the stack trace by running the executors with an 
additional JVM option: -Dio.netty.leakDetectionLevel=advanced.

Here is the stack trace of the leak:

16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
called before it's garbage-collected. See 
http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 0
Created at:
io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)

org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)

org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)

org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)

org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702)

org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)

org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)

org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

We also observe a continuous increase in off heap memory usage at the 
executors. Any help would be appreciated.

> Memory leak in Spark streaming
> --
>
> Key: SPARK-18756
> URL: https://issues.apache.org/jira/browse/SPARK-18756
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Udit Mehrotra
>
> We have a Spark streaming application, that processes data from Kinesis.
> In our application we are observing a memory leak at the Executors with Netty 
> buffers not being released properly, when the Spark BlockManager tries to 
> replicate the input blocks received from Kinesis stream. The leak occurs, 
> when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. 
> However, if we change the Storage level to use MEMORY_AND_DISK, which avoids 
> creating a replica, we do not observe the leak any more. We were able to 
> detect the leak, and obtain the stack trace by running the executors with an 
> additional JVM option: -Dio.netty.leakDetectionLevel=advanced.
> Here is the stack trace of the leak:
> 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 0
> Created at:
>   io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
>   io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)
>   
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)
>   
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182)
>   
> 

[jira] [Created] (SPARK-18756) Memory leak in Spark streaming

2016-12-06 Thread Udit Mehrotra (JIRA)
Udit Mehrotra created SPARK-18756:
-

 Summary: Memory leak in Spark streaming
 Key: SPARK-18756
 URL: https://issues.apache.org/jira/browse/SPARK-18756
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, DStreams
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Udit Mehrotra






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18739) Models in pyspark.classification and regression support setXXXCol methods

2016-12-06 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18739:
-
Summary: Models in pyspark.classification and regression support setXXXCol 
methods  (was: Models in pyspark.classification support setXXXCol methods)

> Models in pyspark.classification and regression support setXXXCol methods
> -
>
> Key: SPARK-18739
> URL: https://issues.apache.org/jira/browse/SPARK-18739
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Now, models in pyspark don't suport {{setXXCol}} methods at all.
> I update models in {{classification.py}} according the hierarchy in the scala 
> side:
> 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
> {{JavaPredictionModel}}
> 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
> 3, create class {{JavaProbabilisticClassificationModel}} inherit 
> {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
> 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
> {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
> {{JavaProbabilisticClassificationModel}}
> 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
> inherit {{JavaClassificationModel}}
> 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
> {{setPredictionCol}} method.
> With regard to models in clustering and features, I suggest that we first add 
> some abstract classes like {{ClusteringModel}}, 
> {{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
> otherwise we need to manually add setXXXCol methods one by one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727398#comment-15727398
 ] 

Shuai Lin commented on SPARK-18736:
---

Ok, sounds good to me.

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18755) Add Randomized Grid Search to Spark ML

2016-12-06 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18755:
---
Description: 
Randomized Grid Search  implements a randomized search over parameters, where 
each setting is sampled from a distribution over possible parameter values. 
This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible 
values.
2. Adding parameters that do not influence the performance does not decrease 
efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while 
the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
build. Only 1 new public function is required.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and 
RandomizedTrainValiationSplit, which can be complicated since we need to deal 
with the models.

I'd prefer option 1 as it's much simpler and straightforward. We can support 
Randomized grid search via some smallest change.


  was:
Randomized Grid Search  implements a randomized search over parameters, where 
each setting is sampled from a distribution over possible parameter values. 
This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible 
values.
2. Adding parameters that do not influence the performance does not decrease 
efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while 
the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
build. Only 1 new public function is required.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and 
RandomizedTrainValiationSplit, which can be complicated since we need to deal 
with the models.

I'd prefer option 1 as it's much simpler and straightforward.



> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward. We can support 
> Randomized grid search via some smallest change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18755) Add Randomized Grid Search to Spark ML

2016-12-06 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18755:
---
Description: 
Randomized Grid Search  implements a randomized search over parameters, where 
each setting is sampled from a distribution over possible parameter values. 
This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible 
values.
2. Adding parameters that do not influence the performance does not decrease 
efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while 
the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
build. Only 1 new public function is required.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and 
RandomizedTrainValiationSplit, which can be complicated since we need to deal 
with the models.

I'd prefer option 1 as it's much simpler and straightforward.


  was:
Randomized Grid Search  implements a randomized search over parameters, where 
each setting is sampled from a distribution over possible parameter values. 
This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible 
values.
2. Adding parameters that do not influence the performance does not decrease 
efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while 
the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
build.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and 
RandomizedTrainValiationSplit.

I'd prefer option 1 as it's much simpler and straightforward.



> Add Randomized Grid Search to Spark ML
> --
>
> Key: SPARK-18755
> URL: https://issues.apache.org/jira/browse/SPARK-18755
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Randomized Grid Search  implements a randomized search over parameters, where 
> each setting is sampled from a distribution over possible parameter values. 
> This has two main benefits over an exhaustive search:
> 1. A budget can be chosen independent of the number of parameters and 
> possible values.
> 2. Adding parameters that do not influence the performance does not decrease 
> efficiency.
> Randomized Grid search usually gives similar result as exhaustive search, 
> while the run time for randomized search is drastically lower.
> For more background, please refer to:
> sklearn: http://scikit-learn.org/stable/modules/grid_search.html
> http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
> http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
> https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.
> There're two ways to implement this in Spark as I see:
> 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
> build. Only 1 new public function is required.
> 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator 
> and RandomizedTrainValiationSplit, which can be complicated since we need to 
> deal with the models.
> I'd prefer option 1 as it's much simpler and straightforward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18671) Add tests to ensure stability of that all Structured Streaming log formats

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727365#comment-15727365
 ] 

Apache Spark commented on SPARK-18671:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16183

> Add tests to ensure stability of that all Structured Streaming log formats
> --
>
> Key: SPARK-18671
> URL: https://issues.apache.org/jira/browse/SPARK-18671
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> To be able to restart StreamingQueries across Spark version, we have already 
> made the logs (offset log, file source log, file sink log) use json. We 
> should added tests with actual json files in the Spark such that any 
> incompatible changes in reading the logs is immediately caught. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18755) Add Randomized Grid Search to Spark ML

2016-12-06 Thread yuhao yang (JIRA)
yuhao yang created SPARK-18755:
--

 Summary: Add Randomized Grid Search to Spark ML
 Key: SPARK-18755
 URL: https://issues.apache.org/jira/browse/SPARK-18755
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang


Randomized Grid Search  implements a randomized search over parameters, where 
each setting is sampled from a distribution over possible parameter values. 
This has two main benefits over an exhaustive search:
1. A budget can be chosen independent of the number of parameters and possible 
values.
2. Adding parameters that do not influence the performance does not decrease 
efficiency.

Randomized Grid search usually gives similar result as exhaustive search, while 
the run time for randomized search is drastically lower.

For more background, please refer to:

sklearn: http://scikit-learn.org/stable/modules/grid_search.html
http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/.

There're two ways to implement this in Spark as I see:
1. Add searchRatio to ParamGridBuilder and conduct sampling directly during 
build.
2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and 
RandomizedTrainValiationSplit.

I'd prefer option 1 as it's much simpler and straightforward.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18754:


Assignee: Michael Armbrust  (was: Apache Spark)

> Rename recentProgresses to recentProgress
> -
>
> Key: SPARK-18754
> URL: https://issues.apache.org/jira/browse/SPARK-18754
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> An informal poll of a bunch of users found this name to be more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18754:


Assignee: Apache Spark  (was: Michael Armbrust)

> Rename recentProgresses to recentProgress
> -
>
> Key: SPARK-18754
> URL: https://issues.apache.org/jira/browse/SPARK-18754
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> An informal poll of a bunch of users found this name to be more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727318#comment-15727318
 ] 

Apache Spark commented on SPARK-18754:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/16182

> Rename recentProgresses to recentProgress
> -
>
> Key: SPARK-18754
> URL: https://issues.apache.org/jira/browse/SPARK-18754
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> An informal poll of a bunch of users found this name to be more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18754:
-
Target Version/s: 2.1.0

> Rename recentProgresses to recentProgress
> -
>
> Key: SPARK-18754
> URL: https://issues.apache.org/jira/browse/SPARK-18754
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> An informal poll of a bunch of users found this name to be more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18754) Rename recentProgresses to recentProgress

2016-12-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18754:


 Summary: Rename recentProgresses to recentProgress
 Key: SPARK-18754
 URL: https://issues.apache.org/jira/browse/SPARK-18754
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Michael Armbrust
Assignee: Michael Armbrust


An informal poll of a bunch of users found this name to be more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18697) Upgrade sbt plugins

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18697:


Assignee: Apache Spark  (was: Weiqing Yang)

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Apache Spark
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18697) Upgrade sbt plugins

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18697:


Assignee: Weiqing Yang  (was: Apache Spark)

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18697) Upgrade sbt plugins

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18697:
--
Fix Version/s: (was: 2.2.0)

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18734.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Represent timestamp in StreamingQueryProgress as formatted string instead of 
> millis
> ---
>
> Key: SPARK-18734
> URL: https://issues.apache.org/jira/browse/SPARK-18734
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18697) Upgrade sbt plugins

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-18697:
---

I had to revert this because it didn't work with Scala 2.10

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18752:


Assignee: Apache Spark

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18752:


Assignee: (was: Apache Spark)

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727203#comment-15727203
 ] 

Apache Spark commented on SPARK-18752:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16179

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727199#comment-15727199
 ] 

Shixiong Zhu commented on SPARK-18753:
--

cc [~liancheng]

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18753:
-
Description: 
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the above example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will be always null, so it 
drops all rows.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}

  was:
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the above example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will be always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}


> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18753:
-
Description: 
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the above example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will be always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}

  was:
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the above example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}


> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18753:
-
Description: 
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the above example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}

  was:
Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the avoid example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}


> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will always null.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-06 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18753:


 Summary: Inconsistent behavior after writing to parquet files
 Key: SPARK-18753
 URL: https://issues.apache.org/jira/browse/SPARK-18753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Shixiong Zhu


Found an inconsistent behavior when using parquet.

{code}
scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
java.lang.Boolean, new java.lang.Boolean(false)).toDS
ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

scala> ds.filter('value === "true").show
+-+
|value|
+-+
+-+

{code}

In the avoid example, `ds.filter('value === "true")` returns nothing as "true" 
will be converted to null and the filter expression will always null.

However, if I store `ds` to a parquet file and read it back, `filter('value === 
"true")` will return non null values.

{code}
scala> ds.write.parquet("testfile")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("testfile")
ds2: org.apache.spark.sql.DataFrame = [value: boolean]

scala> ds2.filter('value === "true").show
+-+
|value|
+-+
| true|
|false|
+-+

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18662) Move cluster managers into their own sub-directory

2016-12-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18662.

   Resolution: Fixed
 Assignee: Anirudh Ramanathan
Fix Version/s: 2.2.0

> Move cluster managers into their own sub-directory
> --
>
> Key: SPARK-18662
> URL: https://issues.apache.org/jira/browse/SPARK-18662
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Minor
> Fix For: 2.2.0
>
>
> As we move to support Kubernetes in addition to Yarn and Mesos 
> (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the 
> cluster managers into a "resource-managers/" sub-directory. This is simply a 
> reorganization.
> Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-12-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reopened SPARK-17838:
--
  Assignee: (was: Hyukjin Kwon)

Re-open as per discussion in PR.

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Resolved] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18171.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15684
[https://github.com/apache/spark/pull/15684]

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Assignee: Shuai Lin
>Priority: Minor
> Fix For: 2.2.0
>
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18171:
--
Assignee: Shuai Lin

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Assignee: Shuai Lin
>Priority: Minor
> Fix For: 2.2.0
>
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-06 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-18752:
--

 Summary: "isSrcLocal" parameter to Hive loadTable / loadPartition 
should come from user
 Key: SPARK-18752
 URL: https://issues.apache.org/jira/browse/SPARK-18752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin
Priority: Minor


We ran into an issue with the HiveShim code that calls "loadTable" and 
"loadPartition" while testing with some recent changes in upstream Hive.

The semantics in Hive changed slightly, and if you provide the wrong value for 
"isSrcLocal" you now can end up with an invalid table: the Hive code will move 
the temp directory to the final destination instead of moving its children.

The problem in Spark is that HiveShim.scala tries to figure out the value of 
"isSrcLocal" based on where the source and target directories are; that's not 
correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
LOCAL" would set it to "true"). So we need to propagate that information from 
the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18741) Reuse/Explicitly clean-up SparkContext in Streaming tests

2016-12-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-18741.
-
Resolution: Not A Problem

> Reuse/Explicitly clean-up SparkContext in Streaming tests
> -
>
> Key: SPARK-18741
> URL: https://issues.apache.org/jira/browse/SPARK-18741
> Project: Spark
>  Issue Type: Bug
>Reporter: Herman van Hovell
>
> Tests in SparkStreaming currently create a SparkContext for each test, and 
> sometimes do not clean-up afterwards. This is resource intensive and it can 
> lead to unneeded test failures (flakyness) when 
> {{park.driver.allowMultipleContexts}} is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-06 Thread Alex Levenson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726986#comment-15726986
 ] 

Alex Levenson commented on SPARK-18728:
---

I think my comment above lists some concrete benefits. Algebird is a very light 
dependency, and if you see anything wrong with it's (small) set of transitive 
dependencies I think we'd be open to figuring out how to fix those sorts of 
issues.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18751:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
> 
>
> Key: SPARK-18751
> URL: https://issues.apache.org/jira/browse/SPARK-18751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When SparkContext.stop is called in Utils.tryOrStopSparkContext (the 
> following three places), it will cause deadlock because the stop method needs 
> to wait for the thread running stop to exit.
> - ContextCleaner.keepCleaning
> - LiveListenerBus.listenerThread.run
> - TaskSchedulerImpl.start



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726963#comment-15726963
 ] 

Apache Spark commented on SPARK-18751:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16178

> Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
> 
>
> Key: SPARK-18751
> URL: https://issues.apache.org/jira/browse/SPARK-18751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When SparkContext.stop is called in Utils.tryOrStopSparkContext (the 
> following three places), it will cause deadlock because the stop method needs 
> to wait for the thread running stop to exit.
> - ContextCleaner.keepCleaning
> - LiveListenerBus.listenerThread.run
> - TaskSchedulerImpl.start



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext

2016-12-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18751:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
> 
>
> Key: SPARK-18751
> URL: https://issues.apache.org/jira/browse/SPARK-18751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When SparkContext.stop is called in Utils.tryOrStopSparkContext (the 
> following three places), it will cause deadlock because the stop method needs 
> to wait for the thread running stop to exit.
> - ContextCleaner.keepCleaning
> - LiveListenerBus.listenerThread.run
> - TaskSchedulerImpl.start



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >