[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN
[ https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728042#comment-15728042 ] Song Jun commented on SPARK-18609: -- I'm working on this~ > [SQL] column mixup with CROSS JOIN > -- > > Key: SPARK-18609 > URL: https://issues.apache.org/jira/browse/SPARK-18609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Furcy Pin > > Reproduced on spark-sql v2.0.2 and on branch master. > {code} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > CREATE TABLE p1 (col TIMESTAMP) ; > CREATE TABLE p2 (col TIMESTAMP) ; > set spark.sql.crossJoin.enabled = true; > -- EXPLAIN > WITH CTE AS ( > SELECT > s2.col as col > FROM p1 > CROSS JOIN ( > SELECT > e.col as col > FROM p2 E > ) s2 > ) > SELECT > T1.col as c1, > T2.col as c2 > FROM CTE T1 > CROSS JOIN CTE T2 > ; > {code} > This returns the following stacktrace : > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: col#21 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at >
[jira] [Resolved] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
[ https://issues.apache.org/jira/browse/SPARK-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18763. --- Resolution: Invalid Fix Version/s: (was: 1.6.0) > What algorithm is used in spark decision tree (is ID3, C4.5 or CART)? > - > > Key: SPARK-18763 > URL: https://issues.apache.org/jira/browse/SPARK-18763 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 1.6.0 >Reporter: lklong >Priority: Minor > Labels: beginner > > hi,spark team > i have a question about "decision tree in mllib",i want to know What > algorithm is used in spark decision tree (is ID3, C4.5 or CART)? > please help me! > thanks very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
[ https://issues.apache.org/jira/browse/SPARK-18763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728037#comment-15728037 ] Sean Owen commented on SPARK-18763: --- Please ask questions on the mailing list. > What algorithm is used in spark decision tree (is ID3, C4.5 or CART)? > - > > Key: SPARK-18763 > URL: https://issues.apache.org/jira/browse/SPARK-18763 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 1.6.0 >Reporter: lklong >Priority: Minor > Labels: beginner > Fix For: 1.6.0 > > > hi,spark team > i have a question about "decision tree in mllib",i want to know What > algorithm is used in spark decision tree (is ID3, C4.5 or CART)? > please help me! > thanks very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18764) Add a warning log when skipping a corrupted file
[ https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18764: Assignee: Apache Spark (was: Shixiong Zhu) > Add a warning log when skipping a corrupted file > > > Key: SPARK-18764 > URL: https://issues.apache.org/jira/browse/SPARK-18764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18764) Add a warning log when skipping a corrupted file
[ https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18764: Assignee: Shixiong Zhu (was: Apache Spark) > Add a warning log when skipping a corrupted file > > > Key: SPARK-18764 > URL: https://issues.apache.org/jira/browse/SPARK-18764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18764) Add a warning log when skipping a corrupted file
[ https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728021#comment-15728021 ] Apache Spark commented on SPARK-18764: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16192 > Add a warning log when skipping a corrupted file > > > Key: SPARK-18764 > URL: https://issues.apache.org/jira/browse/SPARK-18764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18764) Add a warning log when skipping a corrupted file
[ https://issues.apache.org/jira/browse/SPARK-18764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18764: - Affects Version/s: 2.1.0 > Add a warning log when skipping a corrupted file > > > Key: SPARK-18764 > URL: https://issues.apache.org/jira/browse/SPARK-18764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18764) Add a warning log when skipping a corrupted file
Shixiong Zhu created SPARK-18764: Summary: Add a warning log when skipping a corrupted file Key: SPARK-18764 URL: https://issues.apache.org/jira/browse/SPARK-18764 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18763) What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?
lklong created SPARK-18763: -- Summary: What algorithm is used in spark decision tree (is ID3, C4.5 or CART)? Key: SPARK-18763 URL: https://issues.apache.org/jira/browse/SPARK-18763 Project: Spark Issue Type: Question Components: MLlib Affects Versions: 1.6.0 Reporter: lklong Priority: Minor Fix For: 1.6.0 hi,spark team i have a question about "decision tree in mllib",i want to know What algorithm is used in spark decision tree (is ID3, C4.5 or CART)? please help me! thanks very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.
[ https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-18759. --- Resolution: Duplicate duplicate to SPARK-18703 > when use spark streaming with sparksql, lots of temp directories are created. > - > > Key: SPARK-18759 > URL: https://issues.apache.org/jira/browse/SPARK-18759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Albert Cheng > > When use spark streaming with sparksql to insert records into existed hive > table. there are lots of temp directories created. Those directories are > deleted only when jvm exits. But if using sparksql with spark streaming, jvm > will work 7*24 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18762: Assignee: Apache Spark > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727990#comment-15727990 ] Apache Spark commented on SPARK-18762: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/16190 > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18762: Assignee: (was: Apache Spark) > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727986#comment-15727986 ] Kousuke Saruta commented on SPARK-18762: Yeah of course. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727983#comment-15727983 ] Apache Spark commented on SPARK-18761: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/16190 > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727981#comment-15727981 ] Xiangrui Meng commented on SPARK-18762: --- Thanks! Please make sure spark history server still works when ssl is enabled. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727975#comment-15727975 ] Liang-Chi Hsieh commented on SPARK-18756: - As we already upgrade to 4.0.42.Final, this should not be a problem now. > Memory leak in Spark streaming > -- > > Key: SPARK-18756 > URL: https://issues.apache.org/jira/browse/SPARK-18756 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Udit Mehrotra > > We have a Spark streaming application, that processes data from Kinesis. > In our application we are observing a memory leak at the Executors with Netty > buffers not being released properly, when the Spark BlockManager tries to > replicate the input blocks received from Kinesis stream. The leak occurs, > when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. > However, if we change the Storage level to use MEMORY_AND_DISK, which avoids > creating a replica, we do not observe the leak any more. We were able to > detect the leak, and obtain the stack trace by running the executors with an > additional JVM option: -Dio.netty.leakDetectionLevel=advanced. > Here is the stack trace of the leak: > 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not > called before it's garbage-collected. See > http://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: 0 > Created at: > io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) > > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) > > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702) > > org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129) > org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133) > > org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282) > > org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352) > > org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297) > > org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269) > > org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110) > We also observe a continuous increase in off heap memory usage at the > executors. Any help would be appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727972#comment-15727972 ] Liang-Chi Hsieh commented on SPARK-18756: - I believe this bug is fixed by https://github.com/netty/netty/pull/5605 which is included in 4.0.41.Final. > Memory leak in Spark streaming > -- > > Key: SPARK-18756 > URL: https://issues.apache.org/jira/browse/SPARK-18756 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Udit Mehrotra > > We have a Spark streaming application, that processes data from Kinesis. > In our application we are observing a memory leak at the Executors with Netty > buffers not being released properly, when the Spark BlockManager tries to > replicate the input blocks received from Kinesis stream. The leak occurs, > when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. > However, if we change the Storage level to use MEMORY_AND_DISK, which avoids > creating a replica, we do not observe the leak any more. We were able to > detect the leak, and obtain the stack trace by running the executors with an > additional JVM option: -Dio.netty.leakDetectionLevel=advanced. > Here is the stack trace of the leak: > 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not > called before it's garbage-collected. See > http://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: 0 > Created at: > io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) > > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) > > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702) > > org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129) > org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133) > > org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282) > > org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352) > > org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297) > > org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269) > > org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110) > We also observe a continuous increase in off heap memory usage at the > executors. Any help would be appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18713) using SparkR build step wise regression model (glm)
[ https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasann modi updated SPARK-18713: - Comment: was deleted (was: Can u add step wise regression function into upcoming Spark version.) > using SparkR build step wise regression model (glm) > --- > > Key: SPARK-18713 > URL: https://issues.apache.org/jira/browse/SPARK-18713 > Project: Spark > Issue Type: Bug >Reporter: Prasann modi > > In R to build Step wise regression model > step(glm(formula,data,family),direction = "forward")) > function is there. How to build stepwise regression model using SparkR.. > I am using SPARK 2.0.0 and R 3.3.1.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18713) using SparkR build step wise regression model (glm)
[ https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasann modi reopened SPARK-18713: -- Can you add step wise regression function into upcoming Spark version. > using SparkR build step wise regression model (glm) > --- > > Key: SPARK-18713 > URL: https://issues.apache.org/jira/browse/SPARK-18713 > Project: Spark > Issue Type: Bug >Reporter: Prasann modi > > In R to build Step wise regression model > step(glm(formula,data,family),direction = "forward")) > function is there. How to build stepwise regression model using SparkR.. > I am using SPARK 2.0.0 and R 3.3.1.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727963#comment-15727963 ] Kousuke Saruta commented on SPARK-18762: [~mengxr] Ah... O.K, I'll submit a PR to revert it. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.
[ https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727958#comment-15727958 ] Albert Cheng commented on SPARK-18759: -- [~viirya] is right, this issue is duplicate to SPARK-18703. Please close this issue. > when use spark streaming with sparksql, lots of temp directories are created. > - > > Key: SPARK-18759 > URL: https://issues.apache.org/jira/browse/SPARK-18759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Albert Cheng > > When use spark streaming with sparksql to insert records into existed hive > table. there are lots of temp directories created. Those directories are > deleted only when jvm exits. But if using sparksql with spark streaming, jvm > will work 7*24 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18713) using SparkR build step wise regression model (glm)
[ https://issues.apache.org/jira/browse/SPARK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727956#comment-15727956 ] Prasann modi commented on SPARK-18713: -- Can u add step wise regression function into upcoming Spark version. > using SparkR build step wise regression model (glm) > --- > > Key: SPARK-18713 > URL: https://issues.apache.org/jira/browse/SPARK-18713 > Project: Spark > Issue Type: Bug >Reporter: Prasann modi > > In R to build Step wise regression model > step(glm(formula,data,family),direction = "forward")) > function is there. How to build stepwise regression model using SparkR.. > I am using SPARK 2.0.0 and R 3.3.1.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Description: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. was: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this cause several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this introduces several broken links in the UI. For > example, in the master UI, the worker link is https:8081 instead of http:8081 > or https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727929#comment-15727929 ] Xiangrui Meng edited comment on SPARK-18762 at 12/7/16 6:56 AM: cc [~hayashidac] [~sarutak] [~lian cheng] was (Author: mengxr): cc [~hayashidac] [~sarutak] > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727929#comment-15727929 ] Xiangrui Meng commented on SPARK-18762: --- cc [~hayashidac] [~sarutak] > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Description: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this cause several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. was: When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. > More importantly, this cause several broken links in the UI. For example, in > the master UI, the worker link is https:8081 instead of http:8081 or > https:8481. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Priority: Blocker (was: Critical) > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Blocker > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040
[ https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-18762: -- Priority: Critical (was: Major) > Web UI should be http:4040 instead of https:4040 > > > Key: SPARK-18762 > URL: https://issues.apache.org/jira/browse/SPARK-18762 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Web UI >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Critical > > When SSL is enabled, the Spark shell shows: > {code} > Spark context Web UI available at https://192.168.99.1:4040 > {code} > This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18762) Web UI should be http:4040 instead of https:4040
Xiangrui Meng created SPARK-18762: - Summary: Web UI should be http:4040 instead of https:4040 Key: SPARK-18762 URL: https://issues.apache.org/jira/browse/SPARK-18762 Project: Spark Issue Type: Bug Components: Spark Shell, Web UI Affects Versions: 2.1.0 Reporter: Xiangrui Meng When SSL is enabled, the Spark shell shows: {code} Spark context Web UI available at https://192.168.99.1:4040 {code} This is wrong because 4040 is http, not https. It redirects to the https port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.
[ https://issues.apache.org/jira/browse/SPARK-18759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727899#comment-15727899 ] Liang-Chi Hsieh commented on SPARK-18759: - I think this is duplicate to SPARK-18703. > when use spark streaming with sparksql, lots of temp directories are created. > - > > Key: SPARK-18759 > URL: https://issues.apache.org/jira/browse/SPARK-18759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Albert Cheng > > When use spark streaming with sparksql to insert records into existed hive > table. there are lots of temp directories created. Those directories are > deleted only when jvm exits. But if using sparksql with spark streaming, jvm > will work 7*24 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18761: Assignee: Apache Spark (was: Josh Rosen) > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18761: Assignee: Josh Rosen (was: Apache Spark) > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
[ https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727896#comment-15727896 ] Apache Spark commented on SPARK-18761: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/16189 > Uncancellable / unkillable tasks may starve jobs of resoures > > > Key: SPARK-18761 > URL: https://issues.apache.org/jira/browse/SPARK-18761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark's current task cancellation / task killing mechanism is "best effort" > in the sense that some tasks may not be interruptible and may not respond to > their "killed" flags being set. If a significant fraction of a cluster's task > slots are occupied by tasks that have been marked as killed but remain > running then this can lead to a situation where new jobs and tasks are > starved of resources because zombie tasks are holding resources. > I propose to address this problem by introducing a "task reaper" mechanism in > executors to monitor tasks after they are marked for killing in order to > periodically re-attempt the task kill, capture and log stacktraces / warnings > if tasks do not exit in a timely manner, and, optionally, kill the entire > executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures
Josh Rosen created SPARK-18761: -- Summary: Uncancellable / unkillable tasks may starve jobs of resoures Key: SPARK-18761 URL: https://issues.apache.org/jira/browse/SPARK-18761 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark's current task cancellation / task killing mechanism is "best effort" in the sense that some tasks may not be interruptible and may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources because zombie tasks are holding resources. I propose to address this problem by introducing a "task reaper" mechanism in executors to monitor tasks after they are marked for killing in order to periodically re-attempt the task kill, capture and log stacktraces / warnings if tasks do not exit in a timely manner, and, optionally, kill the entire executor JVM if cancelled tasks cannot be killed within some timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727763#comment-15727763 ] Dongjoon Hyun edited comment on SPARK-18709 at 12/7/16 5:32 AM: Hi, [~srowen] The type verification was introduced by https://issues.apache.org/jira/browse/SPARK-14945 when `session.py` is created in 2.0.0. was (Author: dongjoon): @srowen . The type verification was introduced by https://issues.apache.org/jira/browse/SPARK-14945 when `session.py` is created in 2.0.0. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727763#comment-15727763 ] Dongjoon Hyun commented on SPARK-18709: --- @srowen . The type verification was introduced by https://issues.apache.org/jira/browse/SPARK-14945 when `session.py` is created in 2.0.0. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18760) Provide consistent format output for all file formats
[ https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18760: Assignee: Reynold Xin (was: Apache Spark) > Provide consistent format output for all file formats > - > > Key: SPARK-18760 > URL: https://issues.apache.org/jira/browse/SPARK-18760 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently rely on FileFormat implementations to override toString in order > to get a proper explain output. It'd be better to just depend on shortName > for those. > Before: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} > After: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: text, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18760) Provide consistent format output for all file formats
[ https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727746#comment-15727746 ] Apache Spark commented on SPARK-18760: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16187 > Provide consistent format output for all file formats > - > > Key: SPARK-18760 > URL: https://issues.apache.org/jira/browse/SPARK-18760 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently rely on FileFormat implementations to override toString in order > to get a proper explain output. It'd be better to just depend on shortName > for those. > Before: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} > After: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: text, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18760) Provide consistent format output for all file formats
[ https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18760: Assignee: Apache Spark (was: Reynold Xin) > Provide consistent format output for all file formats > - > > Key: SPARK-18760 > URL: https://issues.apache.org/jira/browse/SPARK-18760 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently rely on FileFormat implementations to override toString in order > to get a proper explain output. It'd be better to just depend on shortName > for those. > Before: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} > After: > {noformat} > scala> spark.read.text("test.text").explain() > == Physical Plan == > *FileScan text [value#15] Batched: false, Format: text, Location: > InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18760) Provide consistent format output for all file formats
Reynold Xin created SPARK-18760: --- Summary: Provide consistent format output for all file formats Key: SPARK-18760 URL: https://issues.apache.org/jira/browse/SPARK-18760 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently rely on FileFormat implementations to override toString in order to get a proper explain output. It'd be better to just depend on shortName for those. Before: {noformat} scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct {noformat} After: {noformat} scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11482) Maven repo in IsolatedClientLoader should be configurable.
[ https://issues.apache.org/jira/browse/SPARK-11482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-11482. --- Resolution: Later > Maven repo in IsolatedClientLoader should be configurable. > --- > > Key: SPARK-11482 > URL: https://issues.apache.org/jira/browse/SPARK-11482 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 >Reporter: Doug Balog >Priority: Minor > > The maven repo used to fetch the hive jars and dependencies is hard coded. > A user should be able to override it via configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7263. -- Resolution: Later > Add new shuffle manager which stores shuffle blocks in Parquet > -- > > Key: SPARK-7263 > URL: https://issues.apache.org/jira/browse/SPARK-7263 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: Matt Massie > > I have a working prototype of this feature that can be viewed at > https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 > Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager. > The dictionary support that Parquet provides appreciably reduces the amount of > memory that objects use; however, once Parquet data is shuffled, all the > dictionary information is lost and the column-oriented data is written to > shuffle > blocks in a record-oriented fashion. This shuffle manager addresses this issue > by reading and writing all shuffle blocks in the Parquet format. > If shuffle objects are Avro records, then the Avro $SCHEMA is converted to > Parquet > schema and used directly, otherwise, the Parquet schema is generated via > reflection. > Currently, the only non-Avro keys supported is primitive types. The reflection > code can be improved (or replaced) to support complex records. > The ParquetShufflePair class allows the shuffle key and value to be stored in > Parquet blocks as a single record with a single schema. > This commit adds the following new Spark configuration options: > "spark.shuffle.parquet.compression" - sets the Parquet compression codec > "spark.shuffle.parquet.blocksize" - sets the Parquet block size > "spark.shuffle.parquet.pagesize" - set the Parquet page size > "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off > Parquet does not (and has no plans to) support a streaming API. Metadata > sections > are scattered through a Parquet file making a streaming API difficult. As > such, > the ShuffleBlockFetcherIterator has been modified to fetch the entire contents > of map outputs into temporary blocks before loading the data into the reducer. > Interesting future asides: > o There is no need to define a data serializer (although Spark requires it) > o Parquet support predicate pushdown and projection which could be used at > between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats
[ https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-8398. -- Resolution: Later > Consistently expose Hadoop Configuration/JobConf parameters for Hadoop > input/output formats > --- > > Key: SPARK-8398 > URL: https://issues.apache.org/jira/browse/SPARK-8398 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: koert kuipers >Priority: Trivial > > Currently a custom Hadoop Configuration or JobConf can be passed into quite a > few functions that use Hadoop input formats to read or Hadoop output formats > to write data. The goal of this JIRA is to make this consistent and expose > Configuration/JobConf for all these methods, which facilitates re-use and > discourages many additional parameters (that end up changing the > Configuration/JobConf internally). > See also: > http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18678) Skewed reservoir sampling in SamplingUtils
[ https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18678: -- Summary: Skewed reservoir sampling in SamplingUtils (was: Skewed feature subsampling in Random forest) > Skewed reservoir sampling in SamplingUtils > -- > > Key: SPARK-18678 > URL: https://issues.apache.org/jira/browse/SPARK-18678 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod > > The feature subsampling performed in the RandomForest-implementation from > org.apache.spark.ml.tree.impl.RandomForest > is performed using SamplingUtils.reservoirSampleAndCount > The implementation of the sampling skews feature selection in favor of > features with a higher index. > The skewness is smaller for a large number of features, but completely > dominates the feature selection for a small number of features. The extreme > case is when the number of features is 2 and number of features to select is > 1. > In this case the feature sampling will always pick feature 1 and ignore > feature 0. > Of course this produces low quality models for few features when using > subsampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16948) Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog
[ https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16948. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Use metastore schema instead of inferring schema for ORC in > HiveMetastoreCatalog > > > Key: SPARK-16948 > URL: https://issues.apache.org/jira/browse/SPARK-16948 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > Querying empty partitioned ORC tables from spark-sql throws exception with > "spark.sql.hive.convertMetastoreOrc=true". > {noformat} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18759) when use spark streaming with sparksql, lots of temp directories are created.
Albert Cheng created SPARK-18759: Summary: when use spark streaming with sparksql, lots of temp directories are created. Key: SPARK-18759 URL: https://issues.apache.org/jira/browse/SPARK-18759 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Albert Cheng When use spark streaming with sparksql to insert records into existed hive table. there are lots of temp directories created. Those directories are deleted only when jvm exits. But if using sparksql with spark streaming, jvm will work 7*24 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
[ https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727580#comment-15727580 ] Apache Spark commented on SPARK-18758: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16186 > StreamingQueryListener events from a StreamingQuery should be sent only to > the listeners in the same session as the query > - > > Key: SPARK-18758 > URL: https://issues.apache.org/jira/browse/SPARK-18758 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Priority: Critical > > Listeners added with `sparkSession.streams.addListener(l)` are added to a > SparkSession. So events only from queries in the same session as a listener > should be posted to the listener. > Currently, all the events gets routed through the Spark's main listener bus, > and therefore all StreamingQueryListener events gets posted to > StreamingQueryListeners in all sessions. This is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
[ https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18758: Assignee: (was: Apache Spark) > StreamingQueryListener events from a StreamingQuery should be sent only to > the listeners in the same session as the query > - > > Key: SPARK-18758 > URL: https://issues.apache.org/jira/browse/SPARK-18758 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Priority: Critical > > Listeners added with `sparkSession.streams.addListener(l)` are added to a > SparkSession. So events only from queries in the same session as a listener > should be posted to the listener. > Currently, all the events gets routed through the Spark's main listener bus, > and therefore all StreamingQueryListener events gets posted to > StreamingQueryListeners in all sessions. This is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
[ https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18758: Assignee: Apache Spark > StreamingQueryListener events from a StreamingQuery should be sent only to > the listeners in the same session as the query > - > > Key: SPARK-18758 > URL: https://issues.apache.org/jira/browse/SPARK-18758 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Critical > > Listeners added with `sparkSession.streams.addListener(l)` are added to a > SparkSession. So events only from queries in the same session as a listener > should be posted to the listener. > Currently, all the events gets routed through the Spark's main listener bus, > and therefore all StreamingQueryListener events gets posted to > StreamingQueryListeners in all sessions. This is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
[ https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-18758: -- Description: Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets routed through the Spark's main listener bus, and therefore all StreamingQueryListener events gets posted to StreamingQueryListeners in all sessions. This is wrong. was:Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. > StreamingQueryListener events from a StreamingQuery should be sent only to > the listeners in the same session as the query > - > > Key: SPARK-18758 > URL: https://issues.apache.org/jira/browse/SPARK-18758 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2 >Reporter: Tathagata Das >Priority: Critical > > Listeners added with `sparkSession.streams.addListener(l)` are added to a > SparkSession. So events only from queries in the same session as a listener > should be posted to the listener. > Currently, all the events gets routed through the Spark's main listener bus, > and therefore all StreamingQueryListener events gets posted to > StreamingQueryListeners in all sessions. This is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727542#comment-15727542 ] Liang-Chi Hsieh commented on SPARK-18539: - [~lian cheng], in Parquet's code, looks like a null column still can have its ColumnChunkMetaData. It won't cause problem even before PARQUET-389, because Parquet will check if all values in the chunk are null. PARQUET-389 resolves the case there is no ColumnChunkMetaData for a column, i.e., the column is missing from the Parquet file. So I am not sure is, in a Parquet file, can a nullable column have no ColumnChunkMetaData like you said? Appreciate if you can clarify it. Thanks. > Cannot filter by nonexisting column in parquet file > --- > > Key: SPARK-18539 > URL: https://issues.apache.org/jira/browse/SPARK-18539 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.0.2 >Reporter: Vitaly Gerasimov >Priority: Critical > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DataTypes._ > import org.apache.spark.sql.types.{StructField, StructType} > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}""")) > sc.read > .schema(StructType(Seq(StructField("a", IntegerType > .json(jsonRDD) > .write > .parquet("/tmp/test") > sc.read > .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", > IntegerType, nullable = true > .load("/tmp/test") > .createOrReplaceTempView("table") > sc.sql("select b from table where b is not null").show() > {code} > returns: > {code} > 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalArgumentException: Column [b] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at >
[jira] [Created] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query
Tathagata Das created SPARK-18758: - Summary: StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query Key: SPARK-18758 URL: https://issues.apache.org/jira/browse/SPARK-18758 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.2 Reporter: Tathagata Das Priority: Critical Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters
[ https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18757: - Description: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In it, I try to copy the hierarchy from the scala side. For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters manually for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] was: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In it, I try to copy the hierarchy from the scala side. For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] > Models in Pyspark support column setters > > > Key: SPARK-18757 > URL: https://issues.apache.org/jira/browse/SPARK-18757 > Project: Spark > Issue Type: Brainstorming > Components: ML, PySpark >Reporter: zhengruifeng > > Recently, I found three places in which column setters are missing: > KMeansModel, BisectingKMeansModel and OneVsRestModel. > These three models directly inherit `Model` which dont have columns setters, > so I had to add the missing setters manually in [SPARK-18625] and > [SPARK-18520]. > Fow now, models in pyspark still don't support column setters at all. > I suggest that we keep the hierarchy of pyspark models in line with that in > the scala side: > For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. > In it, I try to copy the hierarchy from the scala side. > For clustering algs, I think we may first create abstract classes > {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, > and make clustering algs inherit it. Then, in the python side, we copy the > hierarchy so that we dont need to add setters manually for each alg. > For features algs, we can also use a abstract class {{FeatureModel}} in scala > side, and do the same thing. > What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727496#comment-15727496 ] Apache Spark commented on SPARK-18753: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16184 > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null, so it drops all rows. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18753: Assignee: Apache Spark > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null, so it drops all rows. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters
[ https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18757: - Description: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In it, I try to copy the hierarchy from the scala side. For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] was: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In it, I try to copy the hierarchy from the scala side. For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] > Models in Pyspark support column setters > > > Key: SPARK-18757 > URL: https://issues.apache.org/jira/browse/SPARK-18757 > Project: Spark > Issue Type: Brainstorming > Components: ML, PySpark >Reporter: zhengruifeng > > Recently, I found three places in which column setters are missing: > KMeansModel, BisectingKMeansModel and OneVsRestModel. > These three models directly inherit `Model` which dont have columns setters, > so I had to add the missing setters manually in [SPARK-18625] and > [SPARK-18520]. > Fow now, models in pyspark still don't support column setters at all. > I suggest that we keep the hierarchy of pyspark models in line with that in > the scala side: > For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. > In it, I try to copy the hierarchy from the scala side. > For clustering algs, I think we may first create abstract classes > {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, > and make clustering algs inherit it. Then, in the python side, we copy the > hierarchy so that we dont need to add setters for each alg. > For features algs, we can also use a abstract class {{FeatureModel}} in scala > side, and do the same thing. > What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18753: Assignee: (was: Apache Spark) > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null, so it drops all rows. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters
[ https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18757: - Description: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. In it, I try to copy the hierarchy from the scala side. For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] was: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379] For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] > Models in Pyspark support column setters > > > Key: SPARK-18757 > URL: https://issues.apache.org/jira/browse/SPARK-18757 > Project: Spark > Issue Type: Brainstorming > Components: ML, PySpark >Reporter: zhengruifeng > > Recently, I found three places in which column setters are missing: > KMeansModel, BisectingKMeansModel and OneVsRestModel. > These three models directly inherit `Model` which dont have columns setters, > so I had to add the missing setters manually in [SPARK-18625] and > [SPARK-18520]. > Fow now, models in pyspark still don't support column setters at all. > I suggest that we keep the hierarchy of pyspark models in line with that in > the scala side: > For classifiation and regression algs, I‘m making a trial in [SPARK-18379]. > In it, I try to copy the hierarchy from the scala side. > For clustering algs, I think we may first create abstract classes > {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering > algs inherit it. Then, in the python side, we copy the hierarchy so that we > dont need to add setters for each alg. > For features algs, we can also use a abstract class {{FeatureModel}} in scala > side, and do the same thing. > What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18757) Models in Pyspark support column setters
[ https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18757: - Description: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and OneVsRestModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379] For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] was: Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and BisectingKMeansModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379] For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] > Models in Pyspark support column setters > > > Key: SPARK-18757 > URL: https://issues.apache.org/jira/browse/SPARK-18757 > Project: Spark > Issue Type: Brainstorming > Components: ML, PySpark >Reporter: zhengruifeng > > Recently, I found three places in which column setters are missing: > KMeansModel, BisectingKMeansModel and OneVsRestModel. > These three models directly inherit `Model` which dont have columns setters, > so I had to add the missing setters manually in [SPARK-18625] and > [SPARK-18520]. > Fow now, models in pyspark still don't support column setters at all. > I suggest that we keep the hierarchy of pyspark models in line with that in > the scala side: > For classifiation and regression algs, I‘m making a trial in [SPARK-18379] > For clustering algs, I think we may first create abstract classes > {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering > algs inherit it. Then, in the python side, we copy the hierarchy so that we > dont need to add setters for each alg. > For features algs, we can also use a abstract class {{FeatureModel}} in scala > side, and do the same thing. > What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18757) Models in Pyspark support column setters
zhengruifeng created SPARK-18757: Summary: Models in Pyspark support column setters Key: SPARK-18757 URL: https://issues.apache.org/jira/browse/SPARK-18757 Project: Spark Issue Type: Brainstorming Components: ML, PySpark Reporter: zhengruifeng Recently, I found three places in which column setters are missing: KMeansModel, BisectingKMeansModel and BisectingKMeansModel. These three models directly inherit `Model` which dont have columns setters, so I had to add the missing setters manually in [SPARK-18625] and [SPARK-18520]. Fow now, models in pyspark still don't support column setters at all. I suggest that we keep the hierarchy of pyspark models in line with that in the scala side: For classifiation and regression algs, I‘m making a trial in [SPARK-18379] For clustering algs, I think we may first create abstract classes {{ClusteringModel}} and {{ProbabilisticClusteringModel}}, and make clustering algs inherit it. Then, in the python side, we copy the hierarchy so that we dont need to add setters for each alg. For features algs, we can also use a abstract class {{FeatureModel}} in scala side, and do the same thing. What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727475#comment-15727475 ] Marcelo Vanzin commented on SPARK-18085: I'm not trying to flame you. I'm trying to point out that the issues you raised, while valid on their own, are not related to the problem described in this bug, and trying to discuss those here is counter-productive. If you care about those you should open separate bugs. The SHS memory issues are not caused by the event log format nor by its size. The SHS does not load the whole event log into memory, not does it keep any JSON-formatted anything in memory. So the fact that the event logs are in JSON is not relevant to how much memory the SHS is using. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18756) Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727466#comment-15727466 ] Sean Owen commented on SPARK-18756: --- CC [~zsxwing] is this related to the netty byte buffer stuff you've been dealing with for a while? > Memory leak in Spark streaming > -- > > Key: SPARK-18756 > URL: https://issues.apache.org/jira/browse/SPARK-18756 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Udit Mehrotra > > We have a Spark streaming application, that processes data from Kinesis. > In our application we are observing a memory leak at the Executors with Netty > buffers not being released properly, when the Spark BlockManager tries to > replicate the input blocks received from Kinesis stream. The leak occurs, > when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. > However, if we change the Storage level to use MEMORY_AND_DISK, which avoids > creating a replica, we do not observe the leak any more. We were able to > detect the leak, and obtain the stack trace by running the executors with an > additional JVM option: -Dio.netty.leakDetectionLevel=advanced. > Here is the stack trace of the leak: > 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not > called before it's garbage-collected. See > http://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: 0 > Created at: > io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) > > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) > > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997) > > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702) > > org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158) > > org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129) > org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133) > > org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282) > > org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352) > > org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297) > > org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269) > > org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110) > We also observe a continuous increase in off heap memory usage at the > executors. Any help would be appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727446#comment-15727446 ] Dmitry Buzolin commented on SPARK-18085: I posted my comments not to start the endless flame on what is orthogonal and what is not. It is up to you how to use them. I speak from my experience running Spark clusters of substantial sizes. If you think offloading problem from memory to disk storage is a way to go - do it. I'd be happy to see SHS performance improvements in next Spark release. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18756) Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated SPARK-18756: -- Description: We have a Spark streaming application, that processes data from Kinesis. In our application we are observing a memory leak at the Executors with Netty buffers not being released properly, when the Spark BlockManager tries to replicate the input blocks received from Kinesis stream. The leak occurs, when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. However, if we change the Storage level to use MEMORY_AND_DISK, which avoids creating a replica, we do not observe the leak any more. We were able to detect the leak, and obtain the stack trace by running the executors with an additional JVM option: -Dio.netty.leakDetectionLevel=advanced. Here is the stack trace of the leak: 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: 0 Created at: io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182) org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:997) org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:702) org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80) org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158) org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129) org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133) org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282) org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352) org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297) org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269) org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110) We also observe a continuous increase in off heap memory usage at the executors. Any help would be appreciated. > Memory leak in Spark streaming > -- > > Key: SPARK-18756 > URL: https://issues.apache.org/jira/browse/SPARK-18756 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Udit Mehrotra > > We have a Spark streaming application, that processes data from Kinesis. > In our application we are observing a memory leak at the Executors with Netty > buffers not being released properly, when the Spark BlockManager tries to > replicate the input blocks received from Kinesis stream. The leak occurs, > when we set Storage Level as MEMORY_AND_DISK_2 for the Kinesis input blocks. > However, if we change the Storage level to use MEMORY_AND_DISK, which avoids > creating a replica, we do not observe the leak any more. We were able to > detect the leak, and obtain the stack trace by running the executors with an > additional JVM option: -Dio.netty.leakDetectionLevel=advanced. > Here is the stack trace of the leak: > 16/12/06 22:30:12 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not > called before it's garbage-collected. See > http://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: 0 > Created at: > io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335) > io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247) > > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69) > > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1182) > >
[jira] [Created] (SPARK-18756) Memory leak in Spark streaming
Udit Mehrotra created SPARK-18756: - Summary: Memory leak in Spark streaming Key: SPARK-18756 URL: https://issues.apache.org/jira/browse/SPARK-18756 Project: Spark Issue Type: Bug Components: Block Manager, DStreams Affects Versions: 2.0.2, 2.0.1, 2.0.0 Reporter: Udit Mehrotra -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18739) Models in pyspark.classification and regression support setXXXCol methods
[ https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18739: - Summary: Models in pyspark.classification and regression support setXXXCol methods (was: Models in pyspark.classification support setXXXCol methods) > Models in pyspark.classification and regression support setXXXCol methods > - > > Key: SPARK-18739 > URL: https://issues.apache.org/jira/browse/SPARK-18739 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng > > Now, models in pyspark don't suport {{setXXCol}} methods at all. > I update models in {{classification.py}} according the hierarchy in the scala > side: > 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class > {{JavaPredictionModel}} > 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}} > 3, create class {{JavaProbabilisticClassificationModel}} inherit > {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it > 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, > {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit > {{JavaProbabilisticClassificationModel}} > 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} > inherit {{JavaClassificationModel}} > 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and > {{setPredictionCol}} method. > With regard to models in clustering and features, I suggest that we first add > some abstract classes like {{ClusteringModel}}, > {{ProbabilisticClusteringModel}}, {{FeatureModel}} in the scala side, > otherwise we need to manually add setXXXCol methods one by one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727398#comment-15727398 ] Shuai Lin commented on SPARK-18736: --- Ok, sounds good to me. > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18755) Add Randomized Grid Search to Spark ML
[ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-18755: --- Description: Randomized Grid Search implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: 1. A budget can be chosen independent of the number of parameters and possible values. 2. Adding parameters that do not influence the performance does not decrease efficiency. Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower. For more background, please refer to: sklearn: http://scikit-learn.org/stable/modules/grid_search.html http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. There're two ways to implement this in Spark as I see: 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. Only 1 new public function is required. 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit, which can be complicated since we need to deal with the models. I'd prefer option 1 as it's much simpler and straightforward. We can support Randomized grid search via some smallest change. was: Randomized Grid Search implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: 1. A budget can be chosen independent of the number of parameters and possible values. 2. Adding parameters that do not influence the performance does not decrease efficiency. Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower. For more background, please refer to: sklearn: http://scikit-learn.org/stable/modules/grid_search.html http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. There're two ways to implement this in Spark as I see: 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. Only 1 new public function is required. 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit, which can be complicated since we need to deal with the models. I'd prefer option 1 as it's much simpler and straightforward. > Add Randomized Grid Search to Spark ML > -- > > Key: SPARK-18755 > URL: https://issues.apache.org/jira/browse/SPARK-18755 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > Randomized Grid Search implements a randomized search over parameters, where > each setting is sampled from a distribution over possible parameter values. > This has two main benefits over an exhaustive search: > 1. A budget can be chosen independent of the number of parameters and > possible values. > 2. Adding parameters that do not influence the performance does not decrease > efficiency. > Randomized Grid search usually gives similar result as exhaustive search, > while the run time for randomized search is drastically lower. > For more background, please refer to: > sklearn: http://scikit-learn.org/stable/modules/grid_search.html > http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ > http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf > https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. > There're two ways to implement this in Spark as I see: > 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during > build. Only 1 new public function is required. > 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator > and RandomizedTrainValiationSplit, which can be complicated since we need to > deal with the models. > I'd prefer option 1 as it's much simpler and straightforward. We can support > Randomized grid search via some smallest change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18755) Add Randomized Grid Search to Spark ML
[ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-18755: --- Description: Randomized Grid Search implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: 1. A budget can be chosen independent of the number of parameters and possible values. 2. Adding parameters that do not influence the performance does not decrease efficiency. Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower. For more background, please refer to: sklearn: http://scikit-learn.org/stable/modules/grid_search.html http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. There're two ways to implement this in Spark as I see: 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. Only 1 new public function is required. 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit, which can be complicated since we need to deal with the models. I'd prefer option 1 as it's much simpler and straightforward. was: Randomized Grid Search implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: 1. A budget can be chosen independent of the number of parameters and possible values. 2. Adding parameters that do not influence the performance does not decrease efficiency. Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower. For more background, please refer to: sklearn: http://scikit-learn.org/stable/modules/grid_search.html http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. There're two ways to implement this in Spark as I see: 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit. I'd prefer option 1 as it's much simpler and straightforward. > Add Randomized Grid Search to Spark ML > -- > > Key: SPARK-18755 > URL: https://issues.apache.org/jira/browse/SPARK-18755 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > Randomized Grid Search implements a randomized search over parameters, where > each setting is sampled from a distribution over possible parameter values. > This has two main benefits over an exhaustive search: > 1. A budget can be chosen independent of the number of parameters and > possible values. > 2. Adding parameters that do not influence the performance does not decrease > efficiency. > Randomized Grid search usually gives similar result as exhaustive search, > while the run time for randomized search is drastically lower. > For more background, please refer to: > sklearn: http://scikit-learn.org/stable/modules/grid_search.html > http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ > http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf > https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. > There're two ways to implement this in Spark as I see: > 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during > build. Only 1 new public function is required. > 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator > and RandomizedTrainValiationSplit, which can be complicated since we need to > deal with the models. > I'd prefer option 1 as it's much simpler and straightforward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18671) Add tests to ensure stability of that all Structured Streaming log formats
[ https://issues.apache.org/jira/browse/SPARK-18671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727365#comment-15727365 ] Apache Spark commented on SPARK-18671: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16183 > Add tests to ensure stability of that all Structured Streaming log formats > -- > > Key: SPARK-18671 > URL: https://issues.apache.org/jira/browse/SPARK-18671 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > To be able to restart StreamingQueries across Spark version, we have already > made the logs (offset log, file source log, file sink log) use json. We > should added tests with actual json files in the Spark such that any > incompatible changes in reading the logs is immediately caught. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18755) Add Randomized Grid Search to Spark ML
yuhao yang created SPARK-18755: -- Summary: Add Randomized Grid Search to Spark ML Key: SPARK-18755 URL: https://issues.apache.org/jira/browse/SPARK-18755 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Randomized Grid Search implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: 1. A budget can be chosen independent of the number of parameters and possible values. 2. Adding parameters that do not influence the performance does not decrease efficiency. Randomized Grid search usually gives similar result as exhaustive search, while the run time for randomized search is drastically lower. For more background, please refer to: sklearn: http://scikit-learn.org/stable/modules/grid_search.html http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. There're two ways to implement this in Spark as I see: 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during build. 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator and RandomizedTrainValiationSplit. I'd prefer option 1 as it's much simpler and straightforward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18754) Rename recentProgresses to recentProgress
[ https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18754: Assignee: Michael Armbrust (was: Apache Spark) > Rename recentProgresses to recentProgress > - > > Key: SPARK-18754 > URL: https://issues.apache.org/jira/browse/SPARK-18754 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > An informal poll of a bunch of users found this name to be more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18754) Rename recentProgresses to recentProgress
[ https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18754: Assignee: Apache Spark (was: Michael Armbrust) > Rename recentProgresses to recentProgress > - > > Key: SPARK-18754 > URL: https://issues.apache.org/jira/browse/SPARK-18754 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Apache Spark > > An informal poll of a bunch of users found this name to be more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18754) Rename recentProgresses to recentProgress
[ https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727318#comment-15727318 ] Apache Spark commented on SPARK-18754: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/16182 > Rename recentProgresses to recentProgress > - > > Key: SPARK-18754 > URL: https://issues.apache.org/jira/browse/SPARK-18754 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > An informal poll of a bunch of users found this name to be more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18754) Rename recentProgresses to recentProgress
[ https://issues.apache.org/jira/browse/SPARK-18754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18754: - Target Version/s: 2.1.0 > Rename recentProgresses to recentProgress > - > > Key: SPARK-18754 > URL: https://issues.apache.org/jira/browse/SPARK-18754 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > An informal poll of a bunch of users found this name to be more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18754) Rename recentProgresses to recentProgress
Michael Armbrust created SPARK-18754: Summary: Rename recentProgresses to recentProgress Key: SPARK-18754 URL: https://issues.apache.org/jira/browse/SPARK-18754 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Michael Armbrust Assignee: Michael Armbrust An informal poll of a bunch of users found this name to be more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18697: Assignee: Apache Spark (was: Weiqing Yang) > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Apache Spark >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18697: Assignee: Weiqing Yang (was: Apache Spark) > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18697: -- Fix Version/s: (was: 2.2.0) > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis
[ https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18734. -- Resolution: Fixed Fix Version/s: 2.1.0 > Represent timestamp in StreamingQueryProgress as formatted string instead of > millis > --- > > Key: SPARK-18734 > URL: https://issues.apache.org/jira/browse/SPARK-18734 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > Easier to read when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-18697: --- I had to revert this because it didn't work with Scala 2.10 > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
[ https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18752: Assignee: Apache Spark > "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user > -- > > Key: SPARK-18752 > URL: https://issues.apache.org/jira/browse/SPARK-18752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > We ran into an issue with the HiveShim code that calls "loadTable" and > "loadPartition" while testing with some recent changes in upstream Hive. > The semantics in Hive changed slightly, and if you provide the wrong value > for "isSrcLocal" you now can end up with an invalid table: the Hive code will > move the temp directory to the final destination instead of moving its > children. > The problem in Spark is that HiveShim.scala tries to figure out the value of > "isSrcLocal" based on where the source and target directories are; that's not > correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA > LOCAL" would set it to "true"). So we need to propagate that information from > the user query down to HiveShim. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
[ https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18752: Assignee: (was: Apache Spark) > "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user > -- > > Key: SPARK-18752 > URL: https://issues.apache.org/jira/browse/SPARK-18752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We ran into an issue with the HiveShim code that calls "loadTable" and > "loadPartition" while testing with some recent changes in upstream Hive. > The semantics in Hive changed slightly, and if you provide the wrong value > for "isSrcLocal" you now can end up with an invalid table: the Hive code will > move the temp directory to the final destination instead of moving its > children. > The problem in Spark is that HiveShim.scala tries to figure out the value of > "isSrcLocal" based on where the source and target directories are; that's not > correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA > LOCAL" would set it to "true"). So we need to propagate that information from > the user query down to HiveShim. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
[ https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727203#comment-15727203 ] Apache Spark commented on SPARK-18752: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/16179 > "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user > -- > > Key: SPARK-18752 > URL: https://issues.apache.org/jira/browse/SPARK-18752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We ran into an issue with the HiveShim code that calls "loadTable" and > "loadPartition" while testing with some recent changes in upstream Hive. > The semantics in Hive changed slightly, and if you provide the wrong value > for "isSrcLocal" you now can end up with an invalid table: the Hive code will > move the temp directory to the final destination instead of moving its > children. > The problem in Spark is that HiveShim.scala tries to figure out the value of > "isSrcLocal" based on where the source and target directories are; that's not > correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA > LOCAL" would set it to "true"). So we need to propagate that information from > the user query down to HiveShim. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727199#comment-15727199 ] Shixiong Zhu commented on SPARK-18753: -- cc [~liancheng] > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null, so it drops all rows. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18753: - Description: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the above example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will be always null, so it drops all rows. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} was: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the above example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will be always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null, so it drops all rows. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18753: - Description: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the above example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will be always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} was: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the above example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will be always > null. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files
[ https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18753: - Description: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the above example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} was: Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the avoid example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} > Inconsistent behavior after writing to parquet files > > > Key: SPARK-18753 > URL: https://issues.apache.org/jira/browse/SPARK-18753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu > > Found an inconsistent behavior when using parquet. > {code} > scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: > java.lang.Boolean, new java.lang.Boolean(false)).toDS > ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] > scala> ds.filter('value === "true").show > +-+ > |value| > +-+ > +-+ > {code} > In the above example, `ds.filter('value === "true")` returns nothing as > "true" will be converted to null and the filter expression will always null. > However, if I store `ds` to a parquet file and read it back, `filter('value > === "true")` will return non null values. > {code} > scala> ds.write.parquet("testfile") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > scala> val ds2 = spark.read.parquet("testfile") > ds2: org.apache.spark.sql.DataFrame = [value: boolean] > scala> ds2.filter('value === "true").show > +-+ > |value| > +-+ > | true| > |false| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18753) Inconsistent behavior after writing to parquet files
Shixiong Zhu created SPARK-18753: Summary: Inconsistent behavior after writing to parquet files Key: SPARK-18753 URL: https://issues.apache.org/jira/browse/SPARK-18753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2, 2.1.0 Reporter: Shixiong Zhu Found an inconsistent behavior when using parquet. {code} scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: java.lang.Boolean, new java.lang.Boolean(false)).toDS ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean] scala> ds.filter('value === "true").show +-+ |value| +-+ +-+ {code} In the avoid example, `ds.filter('value === "true")` returns nothing as "true" will be converted to null and the filter expression will always null. However, if I store `ds` to a parquet file and read it back, `filter('value === "true")` will return non null values. {code} scala> ds.write.parquet("testfile") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("testfile") ds2: org.apache.spark.sql.DataFrame = [value: boolean] scala> ds2.filter('value === "true").show +-+ |value| +-+ | true| |false| +-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18662) Move cluster managers into their own sub-directory
[ https://issues.apache.org/jira/browse/SPARK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18662. Resolution: Fixed Assignee: Anirudh Ramanathan Fix Version/s: 2.2.0 > Move cluster managers into their own sub-directory > -- > > Key: SPARK-18662 > URL: https://issues.apache.org/jira/browse/SPARK-18662 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Minor > Fix For: 2.2.0 > > > As we move to support Kubernetes in addition to Yarn and Mesos > (https://issues.apache.org/jira/browse/SPARK-18278), we should move all the > cluster managers into a "resource-managers/" sub-directory. This is simply a > reorganization. > Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.
[ https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reopened SPARK-17838: -- Assignee: (was: Hyukjin Kwon) Re-open as per discussion in PR. > Strict type checking for arguments with a better messages across APIs. > -- > > Key: SPARK-17838 > URL: https://issues.apache.org/jira/browse/SPARK-17838 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Hyukjin Kwon > Fix For: 2.2.0 > > > It seems there should be more strict type checking for arguments in SparkR > APIs. This was discussed in several PRs. > https://github.com/apache/spark/pull/15239#discussion_r82445435 > Roughly it seems there are three cases as below: > The first case below was described in > https://github.com/apache/spark/pull/15239#discussion_r82445435 > - Check for {{zero-length variable name}} > Some of other cases below were handled in > https://github.com/apache/spark/pull/15231#discussion_r80417904 > - Catch the exception from JVM and format it as pretty > - Check strictly types before calling JVM in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used
[ https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18171. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15684 [https://github.com/apache/spark/pull/15684] > Show correct framework address in mesos master web ui when the advertised > address is used > - > > Key: SPARK-18171 > URL: https://issues.apache.org/jira/browse/SPARK-18171 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Shuai Lin >Assignee: Shuai Lin >Priority: Minor > Fix For: 2.2.0 > > > In [[SPARK-4563]] we added the support for the driver to advertise a > different hostname/ip ({{spark.driver.host}} to the executors other than the > hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But > in the mesos webui's frameworks page, it still shows the driver's binds > hostname/ip (though the web ui link is correct). We should fix it to make > them consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used
[ https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18171: -- Assignee: Shuai Lin > Show correct framework address in mesos master web ui when the advertised > address is used > - > > Key: SPARK-18171 > URL: https://issues.apache.org/jira/browse/SPARK-18171 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Shuai Lin >Assignee: Shuai Lin >Priority: Minor > Fix For: 2.2.0 > > > In [[SPARK-4563]] we added the support for the driver to advertise a > different hostname/ip ({{spark.driver.host}} to the executors other than the > hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But > in the mesos webui's frameworks page, it still shows the driver's binds > hostname/ip (though the web ui link is correct). We should fix it to make > them consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
Marcelo Vanzin created SPARK-18752: -- Summary: "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user Key: SPARK-18752 URL: https://issues.apache.org/jira/browse/SPARK-18752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Marcelo Vanzin Priority: Minor We ran into an issue with the HiveShim code that calls "loadTable" and "loadPartition" while testing with some recent changes in upstream Hive. The semantics in Hive changed slightly, and if you provide the wrong value for "isSrcLocal" you now can end up with an invalid table: the Hive code will move the temp directory to the final destination instead of moving its children. The problem in Spark is that HiveShim.scala tries to figure out the value of "isSrcLocal" based on where the source and target directories are; that's not correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA LOCAL" would set it to "true"). So we need to propagate that information from the user query down to HiveShim. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18741) Reuse/Explicitly clean-up SparkContext in Streaming tests
[ https://issues.apache.org/jira/browse/SPARK-18741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-18741. - Resolution: Not A Problem > Reuse/Explicitly clean-up SparkContext in Streaming tests > - > > Key: SPARK-18741 > URL: https://issues.apache.org/jira/browse/SPARK-18741 > Project: Spark > Issue Type: Bug >Reporter: Herman van Hovell > > Tests in SparkStreaming currently create a SparkContext for each test, and > sometimes do not clean-up afterwards. This is resource intensive and it can > lead to unneeded test failures (flakyness) when > {{park.driver.allowMultipleContexts}} is disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726986#comment-15726986 ] Alex Levenson commented on SPARK-18728: --- I think my comment above lists some concrete benefits. Algebird is a very light dependency, and if you see anything wrong with it's (small) set of transitive dependencies I think we'd be open to figuring out how to fix those sorts of issues. > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Improvement >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
[ https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18751: Assignee: Shixiong Zhu (was: Apache Spark) > Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext > > > Key: SPARK-18751 > URL: https://issues.apache.org/jira/browse/SPARK-18751 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When SparkContext.stop is called in Utils.tryOrStopSparkContext (the > following three places), it will cause deadlock because the stop method needs > to wait for the thread running stop to exit. > - ContextCleaner.keepCleaning > - LiveListenerBus.listenerThread.run > - TaskSchedulerImpl.start -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
[ https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726963#comment-15726963 ] Apache Spark commented on SPARK-18751: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16178 > Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext > > > Key: SPARK-18751 > URL: https://issues.apache.org/jira/browse/SPARK-18751 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > When SparkContext.stop is called in Utils.tryOrStopSparkContext (the > following three places), it will cause deadlock because the stop method needs > to wait for the thread running stop to exit. > - ContextCleaner.keepCleaning > - LiveListenerBus.listenerThread.run > - TaskSchedulerImpl.start -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18751) Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext
[ https://issues.apache.org/jira/browse/SPARK-18751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18751: Assignee: Apache Spark (was: Shixiong Zhu) > Deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext > > > Key: SPARK-18751 > URL: https://issues.apache.org/jira/browse/SPARK-18751 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > When SparkContext.stop is called in Utils.tryOrStopSparkContext (the > following three places), it will cause deadlock because the stop method needs > to wait for the thread running stop to exit. > - ContextCleaner.keepCleaning > - LiveListenerBus.listenerThread.run > - TaskSchedulerImpl.start -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org