[jira] [Closed] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
[ https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-3738. - Resolution: Invalid False alarm, it's because of Hive's default SerDe, which uses '\n' as record delimiter. > InsertIntoHiveTable can't handle strings with "\n" > -- > > Key: SPARK-3738 > URL: https://issues.apache.org/jira/browse/SPARK-3738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Blocker > > Try the following snippet in {{sbt/sbt hive/console}} to reproduce: > {code} > sql("drop table if exists z") > case class Str(s: String) > sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") > table("z").count() > {code} > Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
[ https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-3734: -- > DriverRunner should not read SPARK_HOME from submitter's environment > > > Key: SPARK-3734 > URL: https://issues.apache.org/jira/browse/SPARK-3734 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.1, 1.2.0 > > > If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark > Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the > submitting machine, then DriverRunner will attempt to use the _submitter's_ > JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), > which can cause the job to fail unless the submitter and worker have Java > installed in the same location. > This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of > {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
[ https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3734. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 > DriverRunner should not read SPARK_HOME from submitter's environment > > > Key: SPARK-3734 > URL: https://issues.apache.org/jira/browse/SPARK-3734 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.1, 1.2.0 > > > If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark > Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the > submitting machine, then DriverRunner will attempt to use the _submitter's_ > JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), > which can cause the job to fail unless the submitter and worker have Java > installed in the same location. > This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of > {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
[ https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3734. Resolution: Fixed > DriverRunner should not read SPARK_HOME from submitter's environment > > > Key: SPARK-3734 > URL: https://issues.apache.org/jira/browse/SPARK-3734 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark > Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the > submitting machine, then DriverRunner will attempt to use the _submitter's_ > JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), > which can cause the job to fail unless the submitter and worker have Java > installed in the same location. > This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of > {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152845#comment-14152845 ] Apache Spark commented on SPARK-3709: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2591 > Executors don't always report broadcast block removal properly back to the > driver > - > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152839#comment-14152839 ] Aaron Davidson commented on SPARK-1860: --- The Executor could clean up its own jars when it terminates normally, that seems fine. The impact of this seems limited, though, and it's a good idea to limit the scope of shutdown hooks as much as possible. There are three classes of things to delete: 1. Shuffle files / block manager blocks -- large -- deleted by graceful Executor termination. Can be deleted immediately. 2. Uploaded jars / files -- usually small -- deleted by Worker cleanup. Can be deleted immediately. 3. Logs -- small to medium -- deleted by Worker cleanup. Should not be deleted immediately. Number 1 is most critical in terms of impact on the system. Numbers 2 and 3 are of the same order of magnitude in size, so cleaning up 2 and not 3 is not expected to improve the system's stability by more than a factor of ~2x applications. Note that the intentions of this particular JIRA are very simple: cleanup 2 and 3 for all executors several days after they have terminated, rather than after they have started. If you wish to expand the scope of the Worker or Executor cleanup, that should be covered in a separate JIRA (which is welcome -- I just want to make sure we're on the same page about this particular issue!). > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3740) Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus
[ https://issues.apache.org/jira/browse/SPARK-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3740: --- Labels: starter (was: ) > Use a compressed bitmap to track zero sized blocks in > HighlyCompressedMapStatus > --- > > Key: SPARK-3740 > URL: https://issues.apache.org/jira/browse/SPARK-3740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Labels: starter > > HighlyCompressedMapStatus uses a single long to track the average block size. > However, if a stage has a lot of zero sized outputs, this leads to > inefficiency because executors would need to send requests to fetch zero > sized blocks. > We can use a compressed bitmap to track the zero-sized blocks. > See discussion in https://github.com/apache/spark/pull/2470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3740) Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus
Reynold Xin created SPARK-3740: -- Summary: Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus Key: SPARK-3740 URL: https://issues.apache.org/jira/browse/SPARK-3740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin HighlyCompressedMapStatus uses a single long to track the average block size. However, if a stage has a lot of zero sized outputs, this leads to inefficiency because executors would need to send requests to fetch zero sized blocks. We can use a compressed bitmap to track the zero-sized blocks. See discussion in https://github.com/apache/spark/pull/2470 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3613) Don't record the size of each shuffle block for large jobs
[ https://issues.apache.org/jira/browse/SPARK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3613. Resolution: Fixed Fix Version/s: 1.2.0 > Don't record the size of each shuffle block for large jobs > -- > > Key: SPARK-3613 > URL: https://issues.apache.org/jira/browse/SPARK-3613 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.2.0 > > > MapStatus saves the size of each block (1 byte per block) for a particular > map task. This actually means the shuffle metadata is O(M*R), where M = num > maps and R = num reduces. > If M is greater than a certain size, we should probably just send an average > size instead of a whole array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
[ https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152790#comment-14152790 ] Apache Spark commented on SPARK-3654: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2590 > Implement all extended HiveQL statements/commands with a separate parser > combinator > --- > > Key: SPARK-3654 > URL: https://issues.apache.org/jira/browse/SPARK-3654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. > are currently parsed in a quite hacky way, like this: > {code} > if (sql.trim.toLowerCase.startsWith("cache table")) { > sql.trim.toLowerCase.startsWith("cache table") match { > ... > } > } > {code} > It would be much better to add an extra parser combinator that parses these > syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
[ https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152789#comment-14152789 ] Ravindra Pesala commented on SPARK-3654: https://github.com/apache/spark/pull/2590 > Implement all extended HiveQL statements/commands with a separate parser > combinator > --- > > Key: SPARK-3654 > URL: https://issues.apache.org/jira/browse/SPARK-3654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. > are currently parsed in a quite hacky way, like this: > {code} > if (sql.trim.toLowerCase.startsWith("cache table")) { > sql.trim.toLowerCase.startsWith("cache table") match { > ... > } > } > {code} > It would be much better to add an extra parser combinator that parses these > syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3709: --- Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) > Executors don't always report broadcast block removal properly back to the > driver > - > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3709) Executors don't always report broadcast block removal properly back to the driver
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3709: --- Summary: Executors don't always report broadcast block removal properly back to the driver (was: BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky ) > Executors don't always report broadcast block removal properly back to the > driver > - > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3568: - Priority: Major (was: Minor) Target Version/s: 1.2.0 Shepherd: Xiangrui Meng > Add metrics for ranking algorithms > -- > > Key: SPARK-3568 > URL: https://issues.apache.org/jira/browse/SPARK-3568 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > Include widely-used metrics for ranking algorithms, including: > - Mean Average Precision > - Precision@n: top-n precision > - Discounted cumulative gain (DCG) and NDCG > This implementation attempts to create a new class called *RankingMetrics* > under *org.apache.spark.mllib.evaluation*, which accepts input (prediction > and label pairs) as *RDD[Array[Double], Array[Double]]*. The following > methods will be implemented: > - *averagePrecision(position: Int): Double* this is the presicion@position > - *meanAveragePrecision*: the average of precision@n for all values of n > - *ndcg* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152774#comment-14152774 ] Reynold Xin commented on SPARK-3709: Hanging driver stack trace {code} "pool-1-thread-1-ScalaTest-running-BroadcastSuite" prio=10 tid=0x7f2114812000 nid=0xc8c in Object.wait() [0x7f20bb8fd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked <0x0007a2ff4bb8> (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:512) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1087) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1104) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1118) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1132) at org.apache.spark.rdd.RDD.collect(RDD.scala:775) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:291) at org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:232) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply$mcV$sp(BroadcastSuite.scala:112) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply(BroadcastSuite.scala:112) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$13.apply(BroadcastSuite.scala:112) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) at org.scalatest.Suite$class.withFixture(Suite.scala:1121) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) at org.apache.spark.broadcast.BroadcastSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(BroadcastSuite.scala:26) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.broadcast.BroadcastSuite.runTest(BroadcastSuite.scala:26) ... {code} Executor log {code} 14/09/29 20:35:57.254 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/09/29 20:35:57.502 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicabl e 14/09/29 20:35:57.716 INFO SecurityManager: Changing view acls to: root 14/09/29 20:35:57.717 INFO SecurityManager: Changing modify acls to: root 14/09/29 20:35:57.717 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); u sers with modify permissions: Set(root) 14/09/29 20:35:58.096 INFO Slf4jLogger: Slf4jLogger started 14/09/29 20:35:58.136 INFO Remoting: Starting remoting 14/09/29 20:35:58.279 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@localhost:42339] 14/09/29 20:35:58.280 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@localhost:42339] 14/09/29 20:35:58.287 INFO Utils: Successfully started service 'driverPropsFetcher' on port 42339. 14/09/29 20:35:58.461 INFO SecurityManager: Changing view acls to: root 14/09/29 20:35:58.461 INFO SecurityManager: Changing modify acls to: root 14/09/29 20:35:58.462 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); u sers with modify permissions: Set(root) 14/09/29 20:35:58.466 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/09/29 20:35:58.467 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/09/29 20:35:58.493 INFO Slf4jLogger: Slf4jLogger started 14/09/29 20:35:58.499 INFO Remoting: Starting remoting 14/09/29 20:35:58.502 INFO Remoting: Remoting shut down 14/09/29 20:35:58.503 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/0
[jira] [Commented] (SPARK-3739) Too many splits for small source file in table scanning
[ https://issues.apache.org/jira/browse/SPARK-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152762#comment-14152762 ] Apache Spark commented on SPARK-3739: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2589 > Too many splits for small source file in table scanning > --- > > Key: SPARK-3739 > URL: https://issues.apache.org/jira/browse/SPARK-3739 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Source file input split is probably better based on block on HDFS for table > scanning, other than the settings of 'mapred.reducer.tasks' or > 'taskScheduler.defaultParallelism', see > [http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203]. > Currently, it seems too many splits for small source file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
[ https://issues.apache.org/jira/browse/SPARK-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152765#comment-14152765 ] Cheng Lian commented on SPARK-3738: --- False alarm... it's because of Hive's default SerDe uses '\n' as record delimiter. > InsertIntoHiveTable can't handle strings with "\n" > -- > > Key: SPARK-3738 > URL: https://issues.apache.org/jira/browse/SPARK-3738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Priority: Blocker > > Try the following snippet in {{sbt/sbt hive/console}} to reproduce: > {code} > sql("drop table if exists z") > case class Str(s: String) > sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") > table("z").count() > {code} > Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758 ] Matt Cheah edited comment on SPARK-1860 at 9/30/14 3:21 AM: I agree we should focus the scope on cleaning up things that have successfully finished. Preserving state is beneficial in erroneous cases. However, should it not be the case that when an Executor shuts down, it cleans up all of the files it created? As you stated, the Worker doesn't know where a particular Executor is storing its data, but the Executor should know where it is storing its own data, and be managing it and cleaning up when completed. This is regardless of the distinction between application data and shuffle data. The Executor class has a record of the files and jars added through the SparkContext (currentFiles and currentJars fields) for that Executor's use, and these should naturally expire and be cleaned up when the Executor terminates. The top level application directories may still remain for a short time (Executors can only delete the subdirectory they work with) but the Worker can do a pass and remove empty directories that were all cleaned up by the completed Executor task. was (Author: mcheah): I agree we should focus the scope on cleaning up things that have successfully finished. Preserving state is beneficial in erroneous cases. However, should it not be the case that when an Executor shuts down, it cleans up all of the files it created? As you stated, the Worker doesn't know where a particular Executor is storing its data, but the Executor should know where it is storing its own data, and be managing it and cleaning up when completed. This is regardless of the distinction between application data and shuffle data. The Executor class has a record of the files and jars added through the SparkContext (currentFiles and currentJars fields) for that Executor's use, and these should naturally expire and be cleaned up when the Executor terminates. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3739) Too many splits for small source file in table scanning
Cheng Hao created SPARK-3739: Summary: Too many splits for small source file in table scanning Key: SPARK-3739 URL: https://issues.apache.org/jira/browse/SPARK-3739 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Source file input split is probably better based on block on HDFS for table scanning, other than the settings of 'mapred.reducer.tasks' or 'taskScheduler.defaultParallelism', see [http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.3.0-mr1-cdh5.1.0/org/apache/hadoop/mapred/FileInputFormat.java#203]. Currently, it seems too many splits for small source file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758 ] Matt Cheah edited comment on SPARK-1860 at 9/30/14 3:14 AM: I agree we should focus the scope on cleaning up things that have successfully finished. Preserving state is beneficial in erroneous cases. However, should it not be the case that when an Executor shuts down, it cleans up all of the files it created? As you stated, the Worker doesn't know where a particular Executor is storing its data, but the Executor should know where it is storing its own data, and be managing it and cleaning up when completed. This is regardless of the distinction between application data and shuffle data. The Executor class has a record of the files and jars added through the SparkContext (currentFiles and currentJars fields) for that Executor's use, and these should naturally expire and be cleaned up when the Executor terminates. was (Author: mcheah): I agree we should focus the scope on cleaning up things that have successfully finished. However, should it not be the case that when an Executor shuts down, it cleans up all of the files it created? As you stated, the Worker doesn't know where a particular Executor is storing its data, but the Executor should know where it is storing its own data, and be managing it and cleaning up when completed. This is regardless of the distinction between application data and shuffle data. The Executor class has a record of the files and jars added through the SparkContext (currentFiles and currentJars fields) for that Executor's use, and these should naturally expire and be cleaned up when the Executor terminates. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152758#comment-14152758 ] Matt Cheah commented on SPARK-1860: --- I agree we should focus the scope on cleaning up things that have successfully finished. However, should it not be the case that when an Executor shuts down, it cleans up all of the files it created? As you stated, the Worker doesn't know where a particular Executor is storing its data, but the Executor should know where it is storing its own data, and be managing it and cleaning up when completed. This is regardless of the distinction between application data and shuffle data. The Executor class has a record of the files and jars added through the SparkContext (currentFiles and currentJars fields) for that Executor's use, and these should naturally expire and be cleaned up when the Executor terminates. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152744#comment-14152744 ] Aaron Davidson commented on SPARK-1860: --- Note that there are two separate forms of cleanup: application data cleanup (jars and logs) and shuffle data cleanup. Standalone Worker cleanup deals with the former, Executor termination handlers deal with the latter. The purpose is not to deal with executors that have terminated ungracefully, but to actually clean up old application directories. Here the idea is that a Worker may be running for a very long time (weeks, months) and over time accumulates hundreds of application directories. We want to delete these directories after several days of them being terminated (today we'll clean them up whether or not they're terminated, which loses their jars and logs), after which we presumably don't care anymore. We do not want to clean them up immediately after application termination. The Worker performing shuffle data cleanup for ungracefully terminated Executors is not a bad idea, but is a (smallish) feature onto itself, as the Worker does not currently know where a particular Executor is storing its data. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations
[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152730#comment-14152730 ] Patrick Wendell commented on SPARK-3504: I just updated the title to make it more descriptive. Not an expert on this part of the code! > KMeans optimization: track distances and unmoved cluster centers across > iterations > -- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations
[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3504: --- Summary: KMeans optimization: track distances and unmoved cluster centers across iterations (was: KMeans clusterer is slow, can be sped up by 75%) > KMeans optimization: track distances and unmoved cluster centers across > iterations > -- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing
[ https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2548: --- Labels: starter (was: ) > JavaRecoverableWordCount is missing > --- > > Key: SPARK-2548 > URL: https://issues.apache.org/jira/browse/SPARK-2548 > Project: Spark > Issue Type: Bug > Components: Documentation, Streaming >Affects Versions: 0.9.2, 1.0.1 >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > > JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We > need to rewrite the example because the code was lost during the migration > from spark/spark-incubating to apache/spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3736: --- Priority: Critical (was: Major) Target Version/s: 1.2.0 > Workers should reconnect to Master if disconnected > -- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Ash >Priority: Critical > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3738) InsertIntoHiveTable can't handle strings with "\n"
Cheng Lian created SPARK-3738: - Summary: InsertIntoHiveTable can't handle strings with "\n" Key: SPARK-3738 URL: https://issues.apache.org/jira/browse/SPARK-3738 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Priority: Blocker Try the following snippet in {{sbt/sbt hive/console}} to reproduce: {code} sql("drop table if exists z") case class Str(s: String) sparkContext.parallelize(Str("a\nb") :: Nil, 1).saveAsTable("z") table("z").count() {code} Expected result should be 1, but 2 is returned instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3737) [Docs] Broken Link - Minor typo
[ https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Ohprecio closed SPARK-3737. --- Resolution: Duplicate Closed PR and this issue can be closed: [SPARK]-2558 is the original open 2 days ago. under 'fix the "Building Spark" url #2558' > [Docs] Broken Link - Minor typo > > > Key: SPARK-3737 > URL: https://issues.apache.org/jira/browse/SPARK-3737 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Vincent Ohprecio >Priority: Trivial > Fix For: 1.1.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3737) [Docs] Broken Link - Minor typo
[ https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152709#comment-14152709 ] Vincent Ohprecio commented on SPARK-3737: - I have closed the PR. Similar to: fix the "Building Spark" url #2558 > [Docs] Broken Link - Minor typo > > > Key: SPARK-3737 > URL: https://issues.apache.org/jira/browse/SPARK-3737 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Vincent Ohprecio >Priority: Trivial > Fix For: 1.1.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3737) [Docs] Broken Link - Minor typo
[ https://issues.apache.org/jira/browse/SPARK-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152703#comment-14152703 ] Apache Spark commented on SPARK-3737: - User 'bigsnarfdude' has created a pull request for this issue: https://github.com/apache/spark/pull/2587 > [Docs] Broken Link - Minor typo > > > Key: SPARK-3737 > URL: https://issues.apache.org/jira/browse/SPARK-3737 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Vincent Ohprecio >Priority: Trivial > Fix For: 1.1.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3737) [Docs] Broken Link - Minor typo
Vincent Ohprecio created SPARK-3737: --- Summary: [Docs] Broken Link - Minor typo Key: SPARK-3737 URL: https://issues.apache.org/jira/browse/SPARK-3737 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Vincent Ohprecio Priority: Trivial Fix For: 1.1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
[ https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3729: Target Version/s: 1.2.0 > Null-pointer when constructing a HiveContext when settings are present > -- > > Key: SPARK-3729 > URL: https://issues.apache.org/jira/browse/SPARK-3729 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > {code} > java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:78) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3635) Find Strongly Connected Components with Graphx has a small bug
[ https://issues.apache.org/jira/browse/SPARK-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3635. --- Resolution: Fixed Fix Version/s: 1.1.1 1.2.0 Issue resolved by pull request 2486 [https://github.com/apache/spark/pull/2486] > Find Strongly Connected Components with Graphx has a small bug > -- > > Key: SPARK-3635 > URL: https://issues.apache.org/jira/browse/SPARK-3635 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 > Environment: VMWare, Centos 6.5 >Reporter: Oded Zimerman >Priority: Trivial > Fix For: 1.2.0, 1.1.1 > > Original Estimate: 0h > Remaining Estimate: 0h > > The strongly connected components function (spark / graphx / src / main / > scala / org / apache / spark / graphx / lib / > StronglyConnectedComponents.scala) has a typo in the condition on line 78. > I think the condition should be "if (e.srcAttr._1 < e.dstAttr._1)" instead of > "if (e.srcId < e.dstId)" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152636#comment-14152636 ] Xiangrui Meng commented on SPARK-3434: -- [~shivaram] Could you post the design of the partitioning strategy for block matrices? I think we should have a 2D partitioner, which consists of the row partitioner and column partitioner. A matrix with partitioner (p1, p2) can multiply a matrix with partitioner (p2, p3), resulting a matrix with partitioner (p1, p3). > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152621#comment-14152621 ] Matt Cheah commented on SPARK-1860: --- ExecutorRunner seems to have various cases corresponding to how the Executor exited. ExecutorRunner also creates the directory in fetchAndRunExecutor(). We can catch all of the exit cases there and delete the directory in any case. In the case that the executor failed to exit, however, it would be best to preserve the logs. instead of blindly killing the whole directory. On that note, one other thought is that perhaps we actually want to preserve the directory entirely upon crash since preserving the state will allow us to better understand what happened, i.e. what jars and files were present and so on. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152594#comment-14152594 ] Andrew Ash commented on SPARK-3736: --- I can't tell for sure but this is possibly related to SPARK-704 or SPARK-1771 > Workers should reconnect to Master if disconnected > -- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Ash > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3736: -- Affects Version/s: 1.1.0 > Workers should reconnect to Master if disconnected > -- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Ash > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3736) Workers should reconnect to Master if disconnected
Andrew Ash created SPARK-3736: - Summary: Workers should reconnect to Master if disconnected Key: SPARK-3736 URL: https://issues.apache.org/jira/browse/SPARK-3736 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Ash In standalone mode, when a worker gets disconnected from the master for some reason it never attempts to reconnect. In this situation you have to bounce the worker before it will reconnect to the master. The preferred alternative is to follow what Hadoop does -- when there's a disconnect, attempt to reconnect at a particular interval until successful (I think it repeats indefinitely every 10sec). This has been observed by: - [~pkolaczk] in http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html - [~romi-totango] in http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152552#comment-14152552 ] Andrew Ash commented on SPARK-1860: --- Cleanup on executor shutdown is part of the solution (and should be done IMO) but not all of it. Particularly it won't cover when an executor dies from an OOM or a kill -9 or any other unclean shutdown. The perfect solution would do the event-based cleanup self on executor shutdown, and also a periodic cleaner to get rid of directories that were shutdown uncleanly. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152526#comment-14152526 ] Matthew Farrellee commented on SPARK-3685: -- if you're going to go down this path the best (i'd say correct) way to implement it is to have support from yarn - a way to tell yarn "i'm only going to need X,Y,Z resources from now on" without giving up the execution container. i bet there's a way to re-exec the jvm into a smaller form factor. > Spark's local dir should accept only local paths > > > Key: SPARK-3685 > URL: https://issues.apache.org/jira/browse/SPARK-3685 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or > > When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it > will try to do is create a folder called "hdfs:" and put "tmp" inside it. > This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead > of Hadoop's file system to parse this path. We also need to resolve the path > appropriately. > This may not have an urgent use case, but it fails silently and does what is > least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3694: --- Labels: starter (was: ) > Allow printing object graph of tasks/RDD's with a debug flag > > > Key: SPARK-3694 > URL: https://issues.apache.org/jira/browse/SPARK-3694 > Project: Spark > Issue Type: Bug >Reporter: Patrick Wendell >Assignee: Patrick Wendell > Labels: starter > > This would be useful for debugging extra references inside of RDD's > Here is an example for inspiration: > http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html > We'd want to print this trace for both the RDD serialization inside of the > DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
[ https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152505#comment-14152505 ] Apache Spark commented on SPARK-3734: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2586 > DriverRunner should not read SPARK_HOME from submitter's environment > > > Key: SPARK-3734 > URL: https://issues.apache.org/jira/browse/SPARK-3734 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark > Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the > submitting machine, then DriverRunner will attempt to use the _submitter's_ > JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), > which can cause the job to fail unless the submitter and worker have Java > installed in the same location. > This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of > {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
[ https://issues.apache.org/jira/browse/SPARK-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3734: -- Description: If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the submitting machine, then DriverRunner will attempt to use the _submitter's_ JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), which can cause the job to fail unless the submitter and worker have Java installed in the same location. This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of {{command.environment}}; PR pending shortly. was: If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the submitting machine, then DriverRunner will attempt to use the _submitter's_ JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), which can cause the job to fail unless the submitter and worker don't have Java installed in the same location. This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of {{command.environment}}; PR pending shortly. > DriverRunner should not read SPARK_HOME from submitter's environment > > > Key: SPARK-3734 > URL: https://issues.apache.org/jira/browse/SPARK-3734 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.1.0, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark > Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the > submitting machine, then DriverRunner will attempt to use the _submitter's_ > JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), > which can cause the job to fail unless the submitter and worker have Java > installed in the same location. > This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of > {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS
Xiangrui Meng created SPARK-3735: Summary: Sending the factor directly or AtA based on the cost in ALS Key: SPARK-3735 URL: https://issues.apache.org/jira/browse/SPARK-3735 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Xiangrui Meng It is common to have some super popular products in the dataset. In this case, sending many user factors to the target product block could be more expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i r_ij` to the product block. The cost of sending a single factor is `k`, while the cost of sending a normal equation is much more expensive, `k * (k + 3) / 2`. However, if we use normal equation for all products associated with a user, we don't need to send this user factor. Determining the optimal assignment is hard. But we could use a simple heuristic. Inside any rating block, 1) order the product ids by the number of user ids associated with them in desc order 2) starting from the most popular product, mark popular products as "use normal eq" and calculate the cost Remember the best assignment that comes with the lowest cost and use it for computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3734) DriverRunner should not read SPARK_HOME from submitter's environment
Josh Rosen created SPARK-3734: - Summary: DriverRunner should not read SPARK_HOME from submitter's environment Key: SPARK-3734 URL: https://issues.apache.org/jira/browse/SPARK-3734 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen If you use {{spark-submit}} in {{cluster}} mode to submit a job to a Spark Standalone cluster, and the {{JAVA_HOME}} environment variable is set on the submitting machine, then DriverRunner will attempt to use the _submitter's_ JAVA_HOME to launch the driver process (instead of the worker's JAVA_HOME), which can cause the job to fail unless the submitter and worker don't have Java installed in the same location. This has a pretty simple fix: read JAVA_HOME from {{sys.env}} instead of {{command.environment}}; PR pending shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3007) Add "Dynamic Partition" support to Spark Sql hive
[ https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-3007. --- Resolution: Fixed Fix Version/s: 1.2.0 https://github.com/apache/spark/commit/0bbe7faeffa17577ae8a33dfcd8c4c783db5c909 > Add "Dynamic Partition" support to Spark Sql hive > --- > > Key: SPARK-3007 > URL: https://issues.apache.org/jira/browse/SPARK-3007 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: baishuo > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152421#comment-14152421 ] Matt Cheah commented on SPARK-1860: --- Apologies for any naivety - this will be the first issue I tackle as a Spark contributor. Mingyu and I had a short chat and we thought it would be reasonable for the Executor to simply clean up its own state when it shuts down. Is there anything preventing Executor.stop() from cleaning up the app directory it was using? > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152409#comment-14152409 ] Thomas Graves edited comment on SPARK-3732 at 9/29/14 10:19 PM: I understand your usecase and need for it, but I think at this point I don't think we want to say we support it without properly addressing the bigger picture. The only public supported non-deprecated api is via spark-submit script. This means that we won't guarantee backwards compatibility on it. Note that we specifically discussed the Client being public on another PR and it was decided that it isn't officially supported. The only reason the object was left public was for backwards compatibility with how you used to start spark on yarn with the spark-class script. was (Author: tgraves): I understand your usecase and need for it, but I think at this point I don't think we want to say we support it without properly addressing the bigger picture. The only public supported non-deprecated api is via spark-submit script. This means that we won't guarantee backwards compatibility on it. Note that we specifically discussed the Client beyond public on another PR and it was decided that it isn't officially supported. The only reason the object was left public was for backwards compatibility with how you used to start spark on yarn with the spark-class script. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152409#comment-14152409 ] Thomas Graves commented on SPARK-3732: -- I understand your usecase and need for it, but I think at this point I don't think we want to say we support it without properly addressing the bigger picture. The only public supported non-deprecated api is via spark-submit script. This means that we won't guarantee backwards compatibility on it. Note that we specifically discussed the Client beyond public on another PR and it was decided that it isn't officially supported. The only reason the object was left public was for backwards compatibility with how you used to start spark on yarn with the spark-class script. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152377#comment-14152377 ] Apache Spark commented on SPARK-3709: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2585 > BroadcastSuite.Unpersisting > rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is > flaky > > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3709: --- Assignee: Reynold Xin (was: Cheng Lian) > BroadcastSuite.Unpersisting > rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is > flaky > > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152351#comment-14152351 ] Arun Ahuja commented on SPARK-3630: --- We have seen this issue as well: {code} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) at com.esotericsoftware.kryo.io.Input.fill(Input.java:142) at com.esotericsoftware.kryo.io.Input.require(Input.java:155) at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ... Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) at com.esotericsoftware.kryo.io.Input.fill(Input.java:140) {code} This is with Spark 1.1, running on a Yarn cluster. The issue seems to be fairly frequent but does not happen on every run. > Identify cause of Kryo+Snappy PARSING_ERROR > --- > > Key: SPARK-3630 > URL: https://issues.apache.org/jira/browse/SPARK-3630 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Ankur Dave > > A recent GraphX commit caused non-deterministic exceptions in unit tests so > it was reverted (see SPARK-3400). > Separately, [~aash] observed the same exception stacktrace in an > application-specific Kryo registrator: > {noformat} > com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to > uncompress the chunk: PARSING_ERROR(2) > com.esotericsoftware.kryo.io.Input.fill(Input.java:142) > com.esotericsoftware.kryo.io.Input.require(Input.java:169) > com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) > com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) > > com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) > > com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) > > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > ... > {noformat} > This ticket is to identify the cause of the exception in the GraphX commit so > the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152347#comment-14152347 ] Sung Chung commented on SPARK-3717: --- I think that this would be great as an alternative option. 1. Partitioning by rows (as currently implemented) scales in # of rows. 2. Partitioning by features scales in # of features. With good modularization, I think a lot of tree logic (splitting, building trees) could be shared among the different partitioning schemes. > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3733) Support for programmatically submitting Spark jobs
Sotos Matzanas created SPARK-3733: - Summary: Support for programmatically submitting Spark jobs Key: SPARK-3733 URL: https://issues.apache.org/jira/browse/SPARK-3733 Project: Spark Issue Type: New Feature Affects Versions: 1.1.0 Reporter: Sotos Matzanas Right now it's impossible to programmatically submit Spark jobs via a Scala (or Java) API. We would like to see that in a future version of Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152323#comment-14152323 ] Sotos Matzanas commented on SPARK-3732: --- [~tgraves] this jira is the first step for us to move forward with our own (very limited for now) version of programmatic job submits. I can add another jira to address the big issue, but we would like to see this one resolved first. Once we are confident with our solution we can contribute to the second jira. Let us know what you think > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3708) Backticks aren't handled correctly is aliases
[ https://issues.apache.org/jira/browse/SPARK-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152321#comment-14152321 ] Ravindra Pesala commented on SPARK-3708: I guess here you mentioned about HiveContext as there is no support of backtick in SqlContext. I will work on this issue.Thank you. > Backticks aren't handled correctly is aliases > - > > Key: SPARK-3708 > URL: https://issues.apache.org/jira/browse/SPARK-3708 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Michael Armbrust > > Here's a failing test case: > {code} > sql("SELECT k FROM (SELECT `key` AS `k` FROM src) a") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152318#comment-14152318 ] Marcelo Vanzin commented on SPARK-3732: --- BTW, if the call is removed, it should be possible to do what you want more generically by calling {{SparkSubmit.main}} directly. That's still a little fishy, but it's a lot less fishy than calling Yarn-specific code directly. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152309#comment-14152309 ] Marcelo Vanzin commented on SPARK-3732: --- Removing the call should work regardless; it's redundant, since the code will just exit normally anyway after that. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152308#comment-14152308 ] Thomas Graves commented on SPARK-3732: -- I think you should just change the name of this jira to add support for programmatically calling the spark yarn Client. As you have found this isn't currently supported and I wouldn't want to just remove the exit and say its supported without thinking about other implications. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152302#comment-14152302 ] Sotos Matzanas commented on SPARK-3732: --- we added the option as insurance against backward compatibility, removing the System.exit() call will obviously work unless somebody is checking the exit code from spark-submit explicitely > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152294#comment-14152294 ] Marcelo Vanzin commented on SPARK-3732: --- I think that explicit System.exit() could just be removed. Exposing an option for this sounds like overkill. [~tgraves] had some comments about that call in the past, though. > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152291#comment-14152291 ] Apache Spark commented on SPARK-3732: - User 'smatzana' has created a pull request for this issue: https://github.com/apache/spark/pull/2584 > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152275#comment-14152275 ] Reza Zadeh commented on SPARK-3434: --- It looks like Shivaram Venkataraman from the AMPlab has started work on this. I will be meeting with him to see if we can reuse some his work. > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3717: - Description: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost < partitioning instances cost when: * D * (M * 8 + N) < 2^D * M * B * C * 8 * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N < [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 was: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations,
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152269#comment-14152269 ] Anant Daksh Asthana commented on SPARK-3730: Definately not a spark issue. Just thought some one on here knew a solution. > Any one else having building spark recently > --- > > Key: SPARK-3730 > URL: https://issues.apache.org/jira/browse/SPARK-3730 > Project: Spark > Issue Type: Question >Reporter: Anant Daksh Asthana >Priority: Minor > > I get an assertion error in > spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to > build. > I am building using > mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package > Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152251#comment-14152251 ] Manish Amde commented on SPARK-1547: Sure. I like your naming suggestion. I will rebase from the latest master now that the RF PR has been accepted. I will create a WIP PR soon after (with tests and docs) so that we can discuss the code in greater detail. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
Sotos Matzanas created SPARK-3732: - Summary: Yarn Client: Add option to NOT System.exit() at end of main() Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152238#comment-14152238 ] Reynold Xin commented on SPARK-3709: Adding stack trace {code} [info] - Unpersisting TorrentBroadcast on executors only in distributed mode *** FAILED *** [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 17, localhost): java.io.IOException: sendMessageReliably failed with ACK that signalled a remote error [info] org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:864) [info] org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:856) [info] org.apache.spark.network.nio.ConnectionManager$MessageStatus.markDone(ConnectionManager.scala:61) [info] org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:655) [info] org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515) [info] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] java.lang.Thread.run(Thread.java:745) [info] Driver stacktrace: [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1192) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1181) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180) [info] at scala.coSpark assembly has been built with Hive, including Datanucleus jars on classpath Spark assembly has been built with Hive, including Datanucleus jars on classpath llection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1180) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695) [info] at scala.Option.foreach(Option.scala:236) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:695) [info] ... {code} > BroadcastSuite.Unpersisting > rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is > flaky > > > Key: SPARK-3709 > URL: https://issues.apache.org/jira/browse/SPARK-3709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Cheng Lian >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152228#comment-14152228 ] Andrew Or commented on SPARK-3685: -- Not sure if I fully understand what you mean. If I'm running an executor and I request 30G from the beginning, my application uses all of it to do computation and all is good. After I decommission the executor, I would like to keep 1G just to serve the shuffle files, but this can't be done easily because we need to start a smaller JVM and a smaller container. (Yarn currently doesn't support scaling the size of a container while it's still running yet). Either way we need to transfer some state from the bigger JVM to the smaller JVM, and that adds some complexity to the design. The simplest alternative then would just to write whatever state to an external location and just terminate the executor JVM / container without starting a smaller one, and then have an external service that is long-running to serve these files. One proposal here then is to write these shuffle files to a special location and have the Yarn NM shuffle service serve the files. This is an alternative to DFS shuffle that is, however, highly specific to Yarn. I am doing some initial prototyping of this (the Yarn shuffle) approach to see how this will pan out. > Spark's local dir should accept only local paths > > > Key: SPARK-3685 > URL: https://issues.apache.org/jira/browse/SPARK-3685 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or > > When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it > will try to do is create a folder called "hdfs:" and put "tmp" inside it. > This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead > of Hadoop's file system to parse this path. We also need to resolve the path > appropriately. > This may not have an urgent use case, but it fails silently and does what is > least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152201#comment-14152201 ] Zhan Zhang commented on SPARK-3652: --- Supporting different hive version is required. I think you can send another PR to support thriftserver, which I didn't include in spark-2706 for limiting the scope. > upgrade spark sql hive version to 0.13.1 > > > Key: SPARK-3652 > URL: https://issues.apache.org/jira/browse/SPARK-3652 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > > now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3166) Custom serialisers can't be shipped in application jars
[ https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152194#comment-14152194 ] Paul Wais commented on SPARK-3166: -- +1 this issue is related to: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-find-proto-buffer-class-error-with-RDD-lt-protobuf-gt-td14529.html It looks like this PR places users jars on the executor's root classpath, which should fix the issue investigated in the above thread. > Custom serialisers can't be shipped in application jars > --- > > Key: SPARK-3166 > URL: https://issues.apache.org/jira/browse/SPARK-3166 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Graham Dennis > > Spark cannot currently use a custom serialiser that is shipped with the > application jar. Trying to do this causes a java.lang.ClassNotFoundException > when trying to instantiate the custom serialiser in the Executor processes. > This occurs because Spark attempts to instantiate the custom serialiser > before the application jar has been shipped to the Executor process. A > reproduction of the problem is available here: > https://github.com/GrahamDennis/spark-custom-serialiser > I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches > as of August 21, 2014. This issue is related to SPARK-2878, and my fix for > that issue (https://github.com/apache/spark/pull/1890) also solves this. My > pull request was not merged because it adds the user jar to the Executor > processes' class path at launch time. Such a significant change was thought > by [~rxin] to require more QA, and should be considered for inclusion in 1.2 > at the earliest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Straka updated SPARK-3731: Attachment: worker.log Sample worker.log showing the problem. For example, consider rdd_1_1. It has size 46.3MB. At the beginning, the caching work, but that stops -- the last time rdd_1_1 does not fit into cache, the following is reported: {{14/09/29 21:53:10 WARN CacheManager: Not enough space to cache partition rdd_1_1 in memory! Free memory is 148908945 bytes.}} > RDD caching stops working in pyspark after some time > > > Key: SPARK-3731 > URL: https://issues.apache.org/jira/browse/SPARK-3731 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 > Environment: Linux, 32bit, standalone mode >Reporter: Milan Straka > Attachments: worker.log > > > Consider a file F which when loaded with sc.textFile and cached takes up > slightly more than half of free memory for RDD cache. > When in PySpark the following is executed: > 1) a = sc.textFile(F) > 2) a.cache().count() > 3) b = sc.textFile(F) > 4) b.cache().count() > and then the following is repeated (for example 10 times): > a) a.unpersist().cache().count() > b) b.unpersist().cache().count() > after some time, there are no RDD cached in memory. > Also, since that time, no other RDD ever gets cached (the worker always > reports something like "WARN CacheManager: Not enough space to cache > partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if > rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that > all executors have 0MB memory used (which is consistent with the CacheManager > warning). > When doing the same in scala, everything works perfectly. > I understand that this is a vague description, but I do no know how to > describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3731) RDD caching stops working in pyspark after some time
Milan Straka created SPARK-3731: --- Summary: RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, standalone mode Reporter: Milan Straka Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like "WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2693: Assignee: Ravindra Pesala > Support for UDAF Hive Aggregates like PERCENTILE > > > Key: SPARK-2693 > URL: https://issues.apache.org/jira/browse/SPARK-2693 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Ravindra Pesala > > {code} > SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), > year,month,day FROM raw_data_table GROUP BY year, month, day > MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an > error as shown below. > Exception in thread "main" java.lang.RuntimeException: No handler for udf > class org.apache.hadoop.hive.ql.udf.UDAFPercentile > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) > {code} > This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2693: Priority: Critical (was: Major) > Support for UDAF Hive Aggregates like PERCENTILE > > > Key: SPARK-2693 > URL: https://issues.apache.org/jira/browse/SPARK-2693 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Ravindra Pesala >Priority: Critical > > {code} > SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), > year,month,day FROM raw_data_table GROUP BY year, month, day > MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an > error as shown below. > Exception in thread "main" java.lang.RuntimeException: No handler for udf > class org.apache.hadoop.hive.ql.udf.UDAFPercentile > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) > {code} > This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152152#comment-14152152 ] Matthew Farrellee commented on SPARK-3685: -- the root of the resource problem is how they're handed out. yarn is giving you a whole cpu, some amount of memory, some amount of network and some amount of disk to work with. your executor (like any program) uses different amounts of resources throughout its execution. at points in the execution the resource profile changes, call the demarcated regions "phases". so an executor may transition from a high resource phase to a low resource phase. in a low resource phase, you may want to free up resources for other executors, but maintain enough to do basic operations (say: serve a shuffle file). this is a problem that should be solved by the resource manager. in my opinion, a solution w/i spark that isn't faciliated by the RN is a workaround/hack and should be avoided. an example of a RN facilitated solution might be a message the executor can send to yarn to indicate its resources can be free'd, except for some minimum amount. > Spark's local dir should accept only local paths > > > Key: SPARK-3685 > URL: https://issues.apache.org/jira/browse/SPARK-3685 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or > > When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it > will try to do is create a folder called "hdfs:" and put "tmp" inside it. > This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead > of Hadoop's file system to parse this path. We also need to resolve the path > appropriately. > This may not have an urgent use case, but it fails silently and does what is > least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana closed SPARK-3725. -- Resolution: Not a Problem > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152140#comment-14152140 ] Sean Owen commented on SPARK-3730: -- (The profile is "hadoop-2.3" but that's not the issue.) I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can see from the stack trace. It's not a Spark issue. > Any one else having building spark recently > --- > > Key: SPARK-3730 > URL: https://issues.apache.org/jira/browse/SPARK-3730 > Project: Spark > Issue Type: Question >Reporter: Anant Daksh Asthana >Priority: Minor > > I get an assertion error in > spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to > build. > I am building using > mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package > Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152138#comment-14152138 ] Sean Owen commented on SPARK-3725: -- No, that links to the raw markdown. Truly, the fix is to rebuild the site. The source is fine. > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3730) Any one else having building spark recently
Anant Daksh Asthana created SPARK-3730: -- Summary: Any one else having building spark recently Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
[ https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152113#comment-14152113 ] Apache Spark commented on SPARK-3729: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2583 > Null-pointer when constructing a HiveContext when settings are present > -- > > Key: SPARK-3729 > URL: https://issues.apache.org/jira/browse/SPARK-3729 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > {code} > java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:78) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152101#comment-14152101 ] Joseph K. Bradley commented on SPARK-1547: -- This will be great to have! The WIP code and the list of to-do items look good to me. Small comment: For the losses, it would be good to rename "residual" to either "pseudoresidual" (following Friedman's paper) or to "lossGradient" (which is more literal/accurate). It would also be nice to have the loss classes compute the loss itself, so that we can compute that at the end (and later track it along the way). > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
[ https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-3729: --- Assignee: Michael Armbrust > Null-pointer when constructing a HiveContext when settings are present > -- > > Key: SPARK-3729 > URL: https://issues.apache.org/jira/browse/SPARK-3729 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > {code} > java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:78) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
Michael Armbrust created SPARK-3729: --- Summary: Null-pointer when constructing a HiveContext when settings are present Key: SPARK-3729 URL: https://issues.apache.org/jira/browse/SPARK-3729 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker {code} java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.SQLContext.(SQLContext.scala:78) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:76) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695 ] Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM: here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran I ran into a small problem the ipython magic %matplotlib inline creates an error, you can work around this by commenting it out. Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh was (Author: aedwip): here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Josh Rosen > Fix For: 1.2.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695 ] Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:03 PM: here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh was (Author: aedwip): I must have missed something. I am running iPython notebook over a ssh tunnel. I am still running using the old version. I made sure to export PYSPARK_PYTHON=python2.7 I also tried export PYSPARK_PYTHON=/usr/bin/python2.7\ import IPython print IPython.sys_info() {'commit_hash': '858d539', 'commit_source': 'installation', 'default_encoding': 'UTF-8', 'ipython_path': '/usr/lib/python2.6/site-packages/ipython-0.13.2-py2.6.egg/IPython', 'ipython_version': '0.13.2', 'os_name': 'posix', 'platform': 'Linux-3.4.37-40.44.amzn1.x86_64-x86_64-with-glibc2.2.5', 'sys_executable': '/usr/bin/python2.6', 'sys_platform': 'linux2', 'sys_version': '2.6.9 (unknown, Sep 13 2014, 00:25:11) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]'} here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Josh Rosen > Fix For: 1.2.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152070#comment-14152070 ] Andrew Or commented on SPARK-3685: -- Yeah there will be a design doc soon on the possible solutions for dealing with shuffles. Note that one of the main motivations of doing this is to free up containers in Yarn when an application is not using it, so maintaining a pool of executor containers does not achieve what we want. Also, DFS shuffle is only one of the solutions we will consider, but we probably won't end up relying on it because of the overhead it adds (i.e. we'll probably need a different solution down the road either way). It could be a warning, but I think an exception is appropriate here because the user clearly thinks that its shuffle files are going into HDFS when they're not. Also, the fact that it fails-fast means the user knows Spark won't do what he/she wants before even a single shuffle file is written. Either way I don't feel strongly about this. > Spark's local dir should accept only local paths > > > Key: SPARK-3685 > URL: https://issues.apache.org/jira/browse/SPARK-3685 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or > > When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it > will try to do is create a folder called "hdfs:" and put "tmp" inside it. > This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead > of Hadoop's file system to parse this path. We also need to resolve the path > appropriately. > This may not have an urgent use case, but it fails silently and does what is > least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152061#comment-14152061 ] Anant Daksh Asthana commented on SPARK-3725: Would this pull request be a good idea? https://github.com/apache/spark/pull/2582 > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152050#comment-14152050 ] Sean Owen commented on SPARK-3725: -- Yes of course, it's already in the repo and has been for a while. It was just renamed with a redirect from the old URL. But, that update hasn't hit the public site yet. > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152047#comment-14152047 ] Anant Daksh Asthana commented on SPARK-3725: Would it make sense to add a building spark document in the repo. This will make it easier to find documentation and any one who has the source will have the docs for it as well. > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152045#comment-14152045 ] Sean Owen commented on SPARK-3725: -- The link is correct in the doc source, but the public site needs to be rebuilt to get the new page referenced by README.md. > Link to building spark returns a 404 > > > Key: SPARK-3725 > URL: https://issues.apache.org/jira/browse/SPARK-3725 > Project: Spark > Issue Type: Documentation >Reporter: Anant Daksh Asthana >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3728) RandomForest: Learn models too large to store in memory
Joseph K. Bradley created SPARK-3728: Summary: RandomForest: Learn models too large to store in memory Key: SPARK-3728 URL: https://issues.apache.org/jira/browse/SPARK-3728 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Proposal: Write trees to disk as they are learned. RandomForest currently uses a FIFO queue, which means training all trees at once via breadth-first search. Using a FILO queue would encourage the code to finish one tree before moving on to new ones. This would allow the code to write trees to disk as they are learned. Note: It would also be possible to write nodes to disk as they are learned using a FIFO queue, once the example--node mapping is cached [JIRA]. The [Sequoia Forest package]() does this. However, it could be useful to learn trees progressively, so that future functionality such as early stopping (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
Joseph K. Bradley created SPARK-3727: Summary: DecisionTree, RandomForest: More prediction functionality Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3726) RandomForest: Support for bootstrap options
Joseph K. Bradley created SPARK-3726: Summary: RandomForest: Support for bootstrap options Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3725) Link to building spark returns a 404
Anant Daksh Asthana created SPARK-3725: -- Summary: Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor The README.md link to "Building Spark" returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152018#comment-14152018 ] Patrick Wendell edited comment on SPARK-2331 at 9/29/14 6:34 PM: - Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). This issue is not related to covariance because here the type parameter is always String. So here the compiler actually does understand that EmptyRDD[String] is a sub type of RDD[String]. The issue with the original example is that the Scala compiler will always try to infer the narrowest type it can. So in a foldLeft expression it will by default assume the resulting type is EmptyRDD unless you up cast it to a more general type like you are doing. And the union operation requires an exact type match on the two RDD's, including the type parameter. was (Author: pwendell): Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). For now though you can safely cast it to work around this: {code} scala> sc.emptyRDD[String].asInstanceOf[RDD[String]] res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at :14 {code} > SparkContext.emptyRDD has wrong return type > --- > > Key: SPARK-2331 > URL: https://issues.apache.org/jira/browse/SPARK-2331 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Ian Hummel > > The return type for SparkContext.emptyRDD is EmptyRDD[T]. > It should be RDD[T]. That means you have to add extra type annotations on > code like the below (which creates a union of RDDs over some subset of paths > in a folder) > val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { > (rdd, path) ⇒ > rdd.union(sc.textFile(path)) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3724) RandomForest: More options for feature subset size
Joseph K. Bradley created SPARK-3724: Summary: RandomForest: More options for feature subset size Key: SPARK-3724 URL: https://issues.apache.org/jira/browse/SPARK-3724 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor RandomForest currently supports using a few values for the number of features to sample per node: all, sqrt, log2, etc. It should support any given value (to allow model search). Proposal: If the parameter for specifying the number of features per node is not recognized (as “all”, “sqrt”, etc.), then it will be parsed as a numerical value. The value should be either (a) a real value in [0,1] specifying the fraction of features in each subset or (b) an integer value specifying the number of features in each subset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2331. Resolution: Won't Fix > SparkContext.emptyRDD has wrong return type > --- > > Key: SPARK-2331 > URL: https://issues.apache.org/jira/browse/SPARK-2331 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Ian Hummel > > The return type for SparkContext.emptyRDD is EmptyRDD[T]. > It should be RDD[T]. That means you have to add extra type annotations on > code like the below (which creates a union of RDDs over some subset of paths > in a folder) > val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { > (rdd, path) ⇒ > rdd.union(sc.textFile(path)) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
Joseph K. Bradley created SPARK-3723: Summary: DecisionTree, RandomForest: Add more instrumentation Key: SPARK-3723 URL: https://issues.apache.org/jira/browse/SPARK-3723 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Some simple instrumentation would help advanced users understand performance, and to check whether parameters (such as maxMemoryInMB) need to be tuned. Most important instrumentation (simple): * min, avg, max nodes per group * number of groups (passes over data) More advanced instrumentation: * For each tree (or averaged over trees), training set accuracy after training each level. This would be useful for visualizing learning behavior (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152018#comment-14152018 ] Patrick Wendell commented on SPARK-2331: Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). For now though you can safely cast it to work around this: {code} scala> sc.emptyRDD[String].asInstanceOf[RDD[String]] res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at :14 {code} > SparkContext.emptyRDD has wrong return type > --- > > Key: SPARK-2331 > URL: https://issues.apache.org/jira/browse/SPARK-2331 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Ian Hummel > > The return type for SparkContext.emptyRDD is EmptyRDD[T]. > It should be RDD[T]. That means you have to add extra type annotations on > code like the below (which creates a union of RDDs over some subset of paths > in a folder) > val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { > (rdd, path) ⇒ > rdd.union(sc.textFile(path)) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org