[jira] [Updated] (SPARK-5914) Enable spark-submit to run with only user permission on windows
[ https://issues.apache.org/jira/browse/SPARK-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Judy Nash updated SPARK-5914: - Summary: Enable spark-submit to run with only user permission on windows (was: Spark-submit cannot execute without machine admin permission on windows) > Enable spark-submit to run with only user permission on windows > --- > > Key: SPARK-5914 > URL: https://issues.apache.org/jira/browse/SPARK-5914 > Project: Spark > Issue Type: Bug > Components: Spark Submit, Windows > Environment: Windows >Reporter: Judy Nash >Priority: Minor > > On windows platform only. > If slave is executed with user permission, spark-submit fails with > java.lang.ClassNotFoundException when attempting to read the cached jar from > spark_home\work folder. > This is due to the jars do not have read permission set by default on > windows. Fix is to add read permission explicitly for owner of the file. > Having service account running as admin (equivalent of sudo in Linux) is a > major security risk for production clusters. This make it easy for hackers to > compromise the cluster by taking over the service account. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5968) Parquet warning in spark-shell
Michael Armbrust created SPARK-5968: --- Summary: Parquet warning in spark-shell Key: SPARK-5968 URL: https://issues.apache.org/jira/browse/SPARK-5968 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical {code} 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for rankings parquet.io.ParquetEncodingException: file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: all the files must be contained in the root rankings at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) {code} it is only a warning, but kind of scary for the user. We should try to suppress this through the logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared
[ https://issues.apache.org/jira/browse/SPARK-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334534#comment-14334534 ] Apache Spark commented on SPARK-5967: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4741 > JobProgressListener.stageIdToActiveJobIds never cleared > --- > > Key: SPARK-5967 > URL: https://issues.apache.org/jira/browse/SPARK-5967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Tathagata Das >Assignee: Josh Rosen >Priority: Blocker > > The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So > the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 > running applications like Spark Streaming apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared
[ https://issues.apache.org/jira/browse/SPARK-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5967: - Affects Version/s: 1.2.0 1.2.1 > JobProgressListener.stageIdToActiveJobIds never cleared > --- > > Key: SPARK-5967 > URL: https://issues.apache.org/jira/browse/SPARK-5967 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Tathagata Das >Assignee: Josh Rosen >Priority: Blocker > > The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So > the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 > running applications like Spark Streaming apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared
Tathagata Das created SPARK-5967: Summary: JobProgressListener.stageIdToActiveJobIds never cleared Key: SPARK-5967 URL: https://issues.apache.org/jira/browse/SPARK-5967 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Tathagata Das Assignee: Josh Rosen Priority: Blocker The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 running applications like Spark Streaming apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334499#comment-14334499 ] Twinkle Sachdeva commented on SPARK-4705: - Hi [~vanzin], Please have a look at the Update UI- II. Thanks, Twinkle > Driver retries in cluster mode always fail if event logging is enabled > -- > > Key: SPARK-4705 > URL: https://issues.apache.org/jira/browse/SPARK-4705 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin > Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - > II.png > > > yarn-cluster mode will retry to run the driver in certain failure modes. If > even logging is enabled, this will most probably fail, because: > {noformat} > Exception in thread "Driver" java.io.IOException: Log directory > hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 > already exists! > at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) > at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) > at org.apache.spark.SparkContext.(SparkContext.scala:353) > {noformat} > The even log path should be "more unique". Or perhaps retries of the same app > should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334493#comment-14334493 ] Apache Spark commented on SPARK-5965: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/4739 > Spark UI does not show main class when running app in standalone cluster mode > - > > Key: SPARK-5965 > URL: https://issues.apache.org/jira/browse/SPARK-5965 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Blocker > Attachments: screenshot-1.png > > > I tried this. > bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 > --deploy-mode cluster --class test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > 100 0.5 > The app got launched correctly but the Spark web UI showed the following > screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5965: - Target Version/s: 1.3.0 > Spark UI does not show main class when running app in standalone cluster mode > - > > Key: SPARK-5965 > URL: https://issues.apache.org/jira/browse/SPARK-5965 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Blocker > Attachments: screenshot-1.png > > > I tried this. > bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 > --deploy-mode cluster --class test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > 100 0.5 > The app got launched correctly but the Spark web UI showed the following > screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: Updated UI - II.png > Driver retries in cluster mode always fail if event logging is enabled > -- > > Key: SPARK-4705 > URL: https://issues.apache.org/jira/browse/SPARK-4705 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin > Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - > II.png > > > yarn-cluster mode will retry to run the driver in certain failure modes. If > even logging is enabled, this will most probably fail, because: > {noformat} > Exception in thread "Driver" java.io.IOException: Log directory > hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 > already exists! > at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) > at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) > at org.apache.spark.SparkContext.(SparkContext.scala:353) > {noformat} > The even log path should be "more unique". Or perhaps retries of the same app > should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala
[ https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334469#comment-14334469 ] Saisai Shao commented on SPARK-5953: I think the decoder is initialized in the executor side, so your class should be found in executor side. One way is to add the jar with customized decoder class included using SparkContext#addJar. > NoSuchMethodException with a Kafka input stream and custom decoder in Scala > --- > > Key: SPARK-5953 > URL: https://issues.apache.org/jira/browse/SPARK-5953 > Project: Spark > Issue Type: Bug > Components: Input/Output, Spark Core >Affects Versions: 1.2.0, 1.2.1 > Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5 >Reporter: Aleksandar Stojadinovic > > When using a Kafka input stream, and setting a custom Kafka Decoder, Spark > throws an exception upon starting: > {noformat} > ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting > receiver 0 - java.lang.NoSuchMethodException: > UserLocationEventDecoder.(kafka.utils.VerifiableProperties) > at java.lang.Class.getConstructor0(Class.java:2971) > at java.lang.Class.getConstructor(Class.java:1812) > at > org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: > java.lang.NoSuchMethodException: > UserLocationEventDecoder.(kafka.utils.VerifiableProperties) > 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NoSuchMethodException: > UserLocationEventDecoder.(kafka.utils.VerifiableProperties) > at java.lang.Class.getConstructor0(Class.java:2971) > at java.lang.Class.getConstructor(Class.java:1812) > at > org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The stream is initialized with: > {code:title=Main.scala|borderStyle=solid} > val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], > kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, > topicMap, StorageLevel.MEMORY_AND_DISK); > {code} > The decoder: > {code:title=UserLocationEventDecoder.scala|borderStyle=solid} > import kafka.serializer.Decoder > class UserLocationEventDecoder extends Decoder[UserLocationEvent] { > val kryo = new Kryo() > override def fromBytes(bytes: Array[Byte]): UserLocationEvent = { > val input: Input = new Input(new ByteArrayInputStream(bytes)) > val userLocationEvent: UserLo
[jira] [Resolved] (SPARK-5958) Update block matrix user guide
[ https://issues.apache.org/jira/browse/SPARK-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5958. -- Resolution: Fixed Fix Version/s: 1.3.0 > Update block matrix user guide > -- > > Key: SPARK-5958 > URL: https://issues.apache.org/jira/browse/SPARK-5958 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.3.0 > > > There are some minor issues (see the PR) with the current block matrix user > guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334351#comment-14334351 ] Nicholas Chammas edited comment on SPARK-3850 at 2/24/15 5:16 AM: -- {quote} enabled="false" {quote} Per the parent issue SPARK-3849, I believe this issue is about enabling this rule in a non-intrusive way. So I think we still need this issue. was (Author: nchammas): {quote} enabled="false" {quote} Per the parent issue SPARK-3849, I believe this issue about enabling this rule in a non-intrusive way. So I think we still need this issue. > Scala style: disallow trailing spaces > - > > Key: SPARK-3850 > URL: https://issues.apache.org/jira/browse/SPARK-3850 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Nicholas Chammas > > [Ted Yu on the dev > list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] > suggested using {{WhitespaceEndOfLineChecker}} here: > http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]
[ https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5966: - Description: {code} [tdas @ Zion spark] bin/spark-submit --master local[10] --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] App arguments: Usage: MemoryTest [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar java.lang.ClassNotFoundException: at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} was: {{ [tdas @ Zion spark] bin/spark-submit --master local[10] --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] App arguments: Usage: MemoryTest [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar java.lang.ClassNotFoundException: at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) }} > Spark-submit deploy-mode incorrectly affecting submission when master = > local[4] > - > > Key: SPARK-5966 > URL: https://issues.apache.org/jira/browse/SPARK-5966 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Critical > > {code} > [tdas @ Zion spark] bin/spark-submit --master local[10] --class > test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] > App arguments: > Usage: MemoryTest > [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster > --class test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > java.lang.ClassNotFoundException: > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]
[ https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5966: - Description: {{ [tdas @ Zion spark] bin/spark-submit --master local[10] --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] App arguments: Usage: MemoryTest [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar java.lang.ClassNotFoundException: at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) }} was: {{code}} [tdas @ Zion spark] bin/spark-submit --master local[10] --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] App arguments: Usage: MemoryTest [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar java.lang.ClassNotFoundException: at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {{code}} > Spark-submit deploy-mode incorrectly affecting submission when master = > local[4] > - > > Key: SPARK-5966 > URL: https://issues.apache.org/jira/browse/SPARK-5966 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Critical > > {{ > [tdas @ Zion spark] bin/spark-submit --master local[10] --class > test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] > App arguments: > Usage: MemoryTest > [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster > --class test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > java.lang.ClassNotFoundException: > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > }} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]
Tathagata Das created SPARK-5966: Summary: Spark-submit deploy-mode incorrectly affecting submission when master = local[4] Key: SPARK-5966 URL: https://issues.apache.org/jira/browse/SPARK-5966 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Reporter: Tathagata Das Assignee: Andrew Or Priority: Critical {{code}} [tdas @ Zion spark] bin/spark-submit --master local[10] --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G] App arguments: Usage: MemoryTest [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar java.lang.ClassNotFoundException: at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5965: - Attachment: screenshot-1.png > Spark UI does not show main class when running app in standalone cluster mode > - > > Key: SPARK-5965 > URL: https://issues.apache.org/jira/browse/SPARK-5965 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.3.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Blocker > Attachments: screenshot-1.png > > > I tried this. > bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 > --deploy-mode cluster --class test.MemoryTest --driver-memory 1G > /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar > 100 0.5 > The app got launched correctly but the Spark web UI showed the following > screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode
Tathagata Das created SPARK-5965: Summary: Spark UI does not show main class when running app in standalone cluster mode Key: SPARK-5965 URL: https://issues.apache.org/jira/browse/SPARK-5965 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.0 Reporter: Tathagata Das Assignee: Andrew Or Priority: Blocker I tried this. bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 --deploy-mode cluster --class test.MemoryTest --driver-memory 1G /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar 100 0.5 The app got launched correctly but the Spark web UI showed the following screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361 ] Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:02 AM: - Actually when I filed this ticket, I don't think there was extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we put stuff like this? Within each section, config options should probably be sorted alphabetically. was (Author: rxin): Actually when I filed this ticket, I don't think there was extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we put stuff like this? > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361 ] Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:01 AM: - Actually when I filed this ticket, I don't think there was any extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we put stuff like this? was (Author: rxin): Actually when I filed this ticket, I don't think there was any extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it should also belong in UI. Where do we put stuff like this? > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361 ] Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:01 AM: - Actually when I filed this ticket, I don't think there was extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we put stuff like this? was (Author: rxin): Actually when I filed this ticket, I don't think there was any extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we put stuff like this? > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361 ] Reynold Xin commented on SPARK-1182: Actually when I filed this ticket, I don't think there was any extensive high level grouping of config options (there was maybe two or three). I took a look at the latest grouping and they looked reasonable. A few problems with the existing grouping: 1. Networking section actually contains some shuffle settings. Those should be moved into shuffle. 2. Application Properties contains some serialization settings. Those should be moved into Compression and Serialization. 3. Runtime Environment contains some Mesos specific setting. All the way down on the page we link to each resource manager's own page for resource manager specific settings. 4. Scheduling section contains a mesos specific setting (spark.mesos.coarse). 5. It is sort of arbitrary that "Dynamic allocation" has its own section (at the very least, "a" should be upper case.) 6. "spark.ui.view.acls" is in security, but it should also belong in UI. Where do we put stuff like this? > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5964) Allow spark-daemon.sh to support foreground operation
[ https://issues.apache.org/jira/browse/SPARK-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334359#comment-14334359 ] Apache Spark commented on SPARK-5964: - User 'hellertime' has created a pull request for this issue: https://github.com/apache/spark/pull/3881 > Allow spark-daemon.sh to support foreground operation > - > > Key: SPARK-5964 > URL: https://issues.apache.org/jira/browse/SPARK-5964 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Chris Heller >Priority: Minor > > Add --foreground option to spark-daemon.sh to prevent the process from > daemonizing itself. Useful if running under a watchdog which waits on its > child process. > https://github.com/apache/spark/pull/3881 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5964) Allow spark-daemon.sh to support foreground operation
Chris Heller created SPARK-5964: --- Summary: Allow spark-daemon.sh to support foreground operation Key: SPARK-5964 URL: https://issues.apache.org/jira/browse/SPARK-5964 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Chris Heller Priority: Minor Add --foreground option to spark-daemon.sh to prevent the process from daemonizing itself. Useful if running under a watchdog which waits on its child process. https://github.com/apache/spark/pull/3881 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334355#comment-14334355 ] Nicholas Chammas commented on SPARK-5312: - Yeah, this is not a priority really. I looked into sbt and agree it's probably not suited to the task. I found something else that looks interesting: http://software.clapper.org/classutil/ But I don't have time to look into it. > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334352#comment-14334352 ] Nicholas Chammas commented on SPARK-4123: - Go ahead! I haven't done anything for this yet. > Show new dependencies added in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Priority: Critical > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334351#comment-14334351 ] Nicholas Chammas commented on SPARK-3850: - {quote} enabled="false" {quote} Per the parent issue SPARK-3849, I believe this issue about enabling this rule in a non-intrusive way. So I think we still need this issue. > Scala style: disallow trailing spaces > - > > Key: SPARK-3850 > URL: https://issues.apache.org/jira/browse/SPARK-3850 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Nicholas Chammas > > [Ted Yu on the dev > list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] > suggested using {{WhitespaceEndOfLineChecker}} here: > http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334338#comment-14334338 ] Reynold Xin commented on SPARK-1182: cc [~pwendell] what do you think? I think it is probably still worth it to do that, since the config page is sort of a mess. The backporting documentation fix thing should be small. I don't think we backport documentation fixes that often. > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5960: - Assignee: Chris Fregly > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5959: - Assignee: Chris Fregly > Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver > - > > Key: SPARK-5959 > URL: https://issues.apache.org/jira/browse/SPARK-5959 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > After each block is stored reliably in the WAL (after the store() call > returns), ACK back to Kinesis. > There is still the issue of the ReliableKinesisReceiver dying before the ACK > back to Kinesis, however no data will be lost. Duplicate data is still > possible. > Notes: > * Make sure we're not overloading the checkpoint control plane which uses > DynamoDB. > * May need to disable auto-checkpointing and remove the checkpoint interval. > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability
[ https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334292#comment-14334292 ] Takeshi Yamamuro commented on SPARK-5790: - Thanks for your work :) > VertexRDD's won't zip properly for `diff` capability > > > Key: SPARK-5790 > URL: https://issues.apache.org/jira/browse/SPARK-5790 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Brennon York >Assignee: Brennon York > > For VertexRDD's with differing partition sizes one cannot run commands like > `diff` as it will thrown an IllegalArgumentException. The code below provides > an example: > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd._ > val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id => > (id, id.toInt+1))) > setA.collect.foreach(println(_)) > val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id => > (id, id.toInt+2))) > setB.collect.foreach(println(_)) > val diff = setA.diff(setB) > diff.collect.foreach(println(_)) > val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id => > (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id => (id, id.toInt+2))) > setA.diff(setC).collect > // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of > partitions > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5517) Add input types for Java UDFs
[ https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334279#comment-14334279 ] Michael Armbrust commented on SPARK-5517: - Given the verbosity of expressing this in Java, I'm thinking we will always want it to be optional. In that case I propose we just add a parallel set of APIs once this information is actually used. Thus, we can bump this to 1.4. > Add input types for Java UDFs > - > > Key: SPARK-5517 > URL: https://issues.apache.org/jira/browse/SPARK-5517 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5517) Add input types for Java UDFs
[ https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5517: Target Version/s: 1.4.0 (was: 1.3.0) > Add input types for Java UDFs > - > > Key: SPARK-5517 > URL: https://issues.apache.org/jira/browse/SPARK-5517 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5517) Add input types for Java UDFs
[ https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5517: --- Assignee: Michael Armbrust (was: Reynold Xin) > Add input types for Java UDFs > - > > Key: SPARK-5517 > URL: https://issues.apache.org/jira/browse/SPARK-5517 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering
[ https://issues.apache.org/jira/browse/SPARK-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5963: - Assignee: Stephen Boesch > [MLLIB] Python support for Power Iteration Clustering > - > > Key: SPARK-5963 > URL: https://issues.apache.org/jira/browse/SPARK-5963 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Stephen Boesch >Assignee: Stephen Boesch > Original Estimate: 168h > Remaining Estimate: 168h > > Add python support for the Power Iteration Clustering feature. Here is a > fragment of the python API as we plan to implement it: > /** >* Java stub for Python mllib PowerIterationClustering.run() >*/ > def trainPowerIterationClusteringModel( > data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], > k: Int, > maxIterations: Int, > runs: Int, > initializationMode: String, > seed: java.lang.Long): PowerIterationClusteringModel = { > val picAlg = new PowerIterationClustering() > .setK(k) > .setMaxIterations(maxIterations) > try { > picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) > } finally { > data.rdd.unpersist(blocking = false) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334265#comment-14334265 ] Apache Spark commented on SPARK-5532: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/4738 > Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT > > > Key: SPARK-5532 > URL: https://issues.apache.org/jira/browse/SPARK-5532 > Project: Spark > Issue Type: Bug > Components: MLlib, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Michael Armbrust >Priority: Critical > > Deterministic failure: > {code} > import org.apache.spark.mllib.linalg._ > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sqlContext._ > val data = sc.parallelize(Seq((1.0, > Vectors.dense(1,2,3.toDataFrame("label", "features") > data.repartition(1).saveAsParquetFile("blah") > {code} > If you remove the repartition, then this succeeds. > Here's the stack trace: > {code} > 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, > 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 7, 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: >
[jira] [Created] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering
Stephen Boesch created SPARK-5962: - Summary: [MLLIB] Python support for Power Iteration Clustering Key: SPARK-5962 URL: https://issues.apache.org/jira/browse/SPARK-5962 Project: Spark Issue Type: Bug Components: MLlib Reporter: Stephen Boesch Add python support for the Power Iteration Clustering feature. Here is a fragment of the python API as we plan to implement it: /** * Java stub for Python mllib PowerIterationClustering.run() */ def trainPowerIterationClusteringModel( data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], k: Int, maxIterations: Int, runs: Int, initializationMode: String, seed: java.lang.Long): PowerIterationClusteringModel = { val picAlg = new PowerIterationClustering() .setK(k) .setMaxIterations(maxIterations) try { picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) } finally { data.rdd.unpersist(blocking = false) } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering
Stephen Boesch created SPARK-5963: - Summary: [MLLIB] Python support for Power Iteration Clustering Key: SPARK-5963 URL: https://issues.apache.org/jira/browse/SPARK-5963 Project: Spark Issue Type: Bug Components: MLlib Reporter: Stephen Boesch Add python support for the Power Iteration Clustering feature. Here is a fragment of the python API as we plan to implement it: /** * Java stub for Python mllib PowerIterationClustering.run() */ def trainPowerIterationClusteringModel( data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], k: Int, maxIterations: Int, runs: Int, initializationMode: String, seed: java.lang.Long): PowerIterationClusteringModel = { val picAlg = new PowerIterationClustering() .setK(k) .setMaxIterations(maxIterations) try { picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) } finally { data.rdd.unpersist(blocking = false) } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334256#comment-14334256 ] Chris Fregly commented on SPARK-5959: - Checkpointing at a specific sequence number is supported by the IRecordProcessorCheckpointer interface. > Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver > - > > Key: SPARK-5959 > URL: https://issues.apache.org/jira/browse/SPARK-5959 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > > After each block is stored reliably in the WAL (after the store() call > returns), ACK back to Kinesis. > There is still the issue of the ReliableKinesisReceiver dying before the ACK > back to Kinesis, however no data will be lost. Duplicate data is still > possible. > Notes: > * Make sure we're not overloading the checkpoint control plane which uses > DynamoDB. > * May need to disable auto-checkpointing and remove the checkpoint interval. > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5961) Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes
Chris Fregly created SPARK-5961: --- Summary: Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes Key: SPARK-5961 URL: https://issues.apache.org/jira/browse/SPARK-5961 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly This type of configuration has come up a lot where certain nodes should be dedicated as Spark Streaming Receivers and others as regular Spark Workers. The reasons include the following: 1) Different instance types/sizes for Receivers vs regular Workers 2) Different OS tuning params for Receivers vs regular Workers ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5535) Add parameter for storage levels.
[ https://issues.apache.org/jira/browse/SPARK-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5535: - Assignee: (was: Xiangrui Meng) > Add parameter for storage levels. > - > > Key: SPARK-5535 > URL: https://issues.apache.org/jira/browse/SPARK-5535 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > > Add a special parameter type for storage levels that takes both StorageLevels > and their string representation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger
[ https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-2138. Resolution: Not a Problem Target Version/s: (was: 1.4.0) I'm closing this issue with setting a larger akka.frameSize as the workaround. SPARK-3424 might be relevant. > The KMeans algorithm in the MLlib can lead to the Serialized Task size become > bigger and bigger > --- > > Key: SPARK-2138 > URL: https://issues.apache.org/jira/browse/SPARK-2138 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 0.9.0, 0.9.1 >Reporter: DjvuLee >Assignee: Xiangrui Meng > Labels: clustering > > When the algorithm running at certain stage, when running the reduceBykey() > function, It can lead to Executor Lost and Task lost, after several times. > the application exit. > When this error occurred, the size of serialized task is bigger than 10MB, > and the size become larger as the iteration increase. > the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622 > the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
Chris Fregly created SPARK-5960: --- Summary: Allow AWS credentials to be passed to KinesisUtils.createStream() Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5959: Component/s: Streaming Description: After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes: * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. * Maintain compatibility with existing KinesisReceiver-based code. was: After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. Target Version/s: 1.4.0 Affects Version/s: 1.1.0 Fix Version/s: (was: 1.4.0) > Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver > - > > Key: SPARK-5959 > URL: https://issues.apache.org/jira/browse/SPARK-5959 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly > > After each block is stored reliably in the WAL (after the store() call > returns), ACK back to Kinesis. > There is still the issue of the ReliableKinesisReceiver dying before the ACK > back to Kinesis, however no data will be lost. Duplicate data is still > possible. > Notes: > * Make sure we're not overloading the checkpoint control plane which uses > DynamoDB. > * May need to disable auto-checkpointing and remove the checkpoint interval. > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
Chris Fregly created SPARK-5959: --- Summary: Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver Key: SPARK-5959 URL: https://issues.apache.org/jira/browse/SPARK-5959 Project: Spark Issue Type: Improvement Reporter: Chris Fregly Fix For: 1.4.0 After each block is stored reliably in the WAL (after the store() call returns), ACK back to Kinesis. There is still the issue of the ReliableKinesisReceiver dying before the ACK back to Kinesis, however no data will be lost. Duplicate data is still possible. Notes * Make sure we're not overloading the checkpoint control plane which uses DynamoDB. * May need to disable auto-checkpointing and remove the checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5958) Update block matrix user guide
Xiangrui Meng created SPARK-5958: Summary: Update block matrix user guide Key: SPARK-5958 URL: https://issues.apache.org/jira/browse/SPARK-5958 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor There are some minor issues (see the PR) with the current block matrix user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5958) Update block matrix user guide
[ https://issues.apache.org/jira/browse/SPARK-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334231#comment-14334231 ] Apache Spark commented on SPARK-5958: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4737 > Update block matrix user guide > -- > > Key: SPARK-5958 > URL: https://issues.apache.org/jira/browse/SPARK-5958 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > There are some minor issues (see the PR) with the current block matrix user > guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5910) DataFrame.selectExpr("col as newName") does not work
[ https://issues.apache.org/jira/browse/SPARK-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334230#comment-14334230 ] Apache Spark commented on SPARK-5910: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/4736 > DataFrame.selectExpr("col as newName") does not work > > > Key: SPARK-5910 > URL: https://issues.apache.org/jira/browse/SPARK-5910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > {code} > val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}""")) > sqlContext.jsonRDD(rdd).selectExpr("a as newName") > {code} > {code} > java.lang.RuntimeException: [1.3] failure: ``or'' expected but `as' found > a as newName > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) > {code} > For selectExpr, we need to use projection parser instead of expression parser > (which cannot parse AS). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5532: --- Assignee: Michael Armbrust (was: Cheng Lian) > Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT > > > Key: SPARK-5532 > URL: https://issues.apache.org/jira/browse/SPARK-5532 > Project: Spark > Issue Type: Bug > Components: MLlib, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Michael Armbrust >Priority: Critical > > Deterministic failure: > {code} > import org.apache.spark.mllib.linalg._ > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > import sqlContext._ > val data = sc.parallelize(Seq((1.0, > Vectors.dense(1,2,3.toDataFrame("label", "features") > data.repartition(1).saveAsParquetFile("blah") > {code} > If you remove the repartition, then this succeeds. > Here's the stack trace: > {code} > 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, > 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 7, 192.168.1.230): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$
[jira] [Resolved] (SPARK-5873) Can't see partially analyzed plans
[ https://issues.apache.org/jira/browse/SPARK-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5873. - Resolution: Fixed Fix Version/s: 1.3.0 > Can't see partially analyzed plans > -- > > Key: SPARK-5873 > URL: https://issues.apache.org/jira/browse/SPARK-5873 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 1.3.0 > > > Our analysis checks are great for users who make mistakes but make it > impossible to see what is going wrong when there is a bug in the analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5722. - Resolution: Fixed Fix Version/s: 1.2.2 Issue resolved by pull request 4681 [https://github.com/apache/spark/pull/4681] > Infer_schema_type incorrect for Integers in pyspark > --- > > Key: SPARK-5722 > URL: https://issues.apache.org/jira/browse/SPARK-5722 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Don Drake > Fix For: 1.2.2 > > > The Integers datatype in Python does not match what a Scala/Java integer is > defined as. This causes inference of data types and schemas to fail when > data is larger than 2^32 and it is inferred incorrectly as an Integer. > Since the range of valid Python integers is wider than Java Integers, this > causes problems when inferring Integer vs. Long datatypes. This will cause > problems when attempting to save SchemaRDD as Parquet or JSON. > Here's an example: > {code} > >>> sqlCtx = SQLContext(sc) > >>> from pyspark.sql import Row > >>> rdd = sc.parallelize([Row(f1='a', f2=100)]) > >>> srdd = sqlCtx.inferSchema(rdd) > >>> srdd.schema() > StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) > {code} > That number is a LongType in Java, but an Integer in python. We need to > check the value to see if it should really by a LongType when a IntegerType > is initially inferred. > More tests: > {code} > >>> from pyspark.sql import _infer_type > # OK > >>> print _infer_type(1) > IntegerType > # OK > >>> print _infer_type(2**31-1) > IntegerType > #WRONG > >>> print _infer_type(2**31) > #WRONG > IntegerType > >>> print _infer_type(2**61 ) > #OK > IntegerType > >>> print _infer_type(2**71 ) > LongType > {code} > Java Primitive Types defined: > http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html > Python Built-in Types: > https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.
[ https://issues.apache.org/jira/browse/SPARK-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5935. - Resolution: Fixed Fix Version/s: 1.3.0 > Accept MapType in the schema provided to a JSON dataset. > > > Key: SPARK-5935 > URL: https://issues.apache.org/jira/browse/SPARK-5935 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334166#comment-14334166 ] Marcelo Vanzin edited comment on SPARK-5124 at 2/24/15 1:08 AM: Hi [~zsxwing], I briefly looked over the design and parts of the implementation, and I have some comments. Overall this looks OK, even if it kinda does look a lot like akka's actor system, but I don't think there would be a problem implementing it over a different RPC library. One thing that I found very confusing is {{RpcEndpoint}} vs. {{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies networking. I think Reynold touched on this when he talked about using an event loop for "local actors". Perhaps a more flexible approach would be something like: - A generic {{Endpoint}} that defines the {{receive()}} method and other generic interfaces - {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote connections. Things like "onConnected" and "onDisassociated" become messages passed to {{receive()}}, kinda like in akka today. This means that there's no need for a specialized {{RpcEndpoint}} interface. - "local" Endpoints could be exposed directly, without the need for an {{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some wrapper to expose {{send()}} and {{ask()}} so that the client interface looks similar for remote and local endpoints. - The default {{Endpoint}} has no thread-safety guarantees. You can wrap an {{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a queue, or synchronize your {{receive()}} method (although that can block the dispatcher thread, which could be bad). But this would easily allow actors to process multiple messages concurrently if desired. What do you think? was (Author: vanzin): Hi [~zsxwing], I briefly looked over the design and parts of the implementation, and I have some comments. Overall this looks OK, even if it kinda does look a lot like akka's actor system, but I don't think there would be a problem implementing it over a different RPC library. One thing that I found very confusing is {{RpcEndpoint}} vs. {{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies networking. I think Reynold touched on this when he talked about using an event loop for "local actors". Perhaps a more flexible approach would be something like: - A generic {{Endpoint}} that defines the {{receive()}} method and other generic interfaces - {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote connections. Things like "onConnected" and "onDisassociated" become messages passed to {{receive()}}, kinda like in akka today. This means that there's no need for a specialized {{RpcEndpoint}} interface. - "local" Endpoints could be exposed directly, without the need for an {{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some wrapper to expose {{send()}} and {{ask()}} so that the client interface looks similar for remote and local endpoints. - The default {{Endpoint}} has no thread-safety guarantees. You can wrap an {{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a queue, or synchronize your {{receive()}} method (although that can block the dispatcher thread, which could be bad). But this would easily allow actors to process multiple messages concurrently is desired. What do you think? > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka
[ https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-5293: -- Issue Type: Umbrella (was: Improvement) > Enable Spark user applications to use different versions of Akka > > > Key: SPARK-5293 > URL: https://issues.apache.org/jira/browse/SPARK-5293 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Reynold Xin > > A lot of Spark user applications are using (or want to use) Akka. Akka as a > whole can contribute great architectural simplicity and uniformity. However, > because Spark depends on Akka, it is not possible for users to rely on > different versions, and we have received many requests in the past asking for > help about this specific issue. For example, Spark Streaming might be used as > the receiver of Akka messages - but our dependency on Akka requires the > upstream Akka actors to also use the identical version of Akka. > Since our usage of Akka is limited (mainly for RPC and single-threaded event > loop), we can replace it with alternative RPC implementations and a common > event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop
[ https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-5214: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-5293 > Add EventLoop and change DAGScheduler to an EventLoop > - > > Key: SPARK-5214 > URL: https://issues.apache.org/jira/browse/SPARK-5214 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.3.0 > > > As per discussion in SPARK-5124, DAGScheduler can simply use a queue & event > loop to process events. It would be great when we want to decouple Akka in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5293) Enable Spark user applications to use different versions of Akka
[ https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334195#comment-14334195 ] Marcelo Vanzin commented on SPARK-5293: --- I converted this bug to an umbrella so that it's easier to track sub-tasks. Hope that's ok for everybody. > Enable Spark user applications to use different versions of Akka > > > Key: SPARK-5293 > URL: https://issues.apache.org/jira/browse/SPARK-5293 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Reynold Xin > > A lot of Spark user applications are using (or want to use) Akka. Akka as a > whole can contribute great architectural simplicity and uniformity. However, > because Spark depends on Akka, it is not possible for users to rely on > different versions, and we have received many requests in the past asking for > help about this specific issue. For example, Spark Streaming might be used as > the receiver of Akka messages - but our dependency on Akka requires the > upstream Akka actors to also use the identical version of Akka. > Since our usage of Akka is limited (mainly for RPC and single-threaded event > loop), we can replace it with alternative RPC implementations and a common > event loop in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-5124: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-5293 > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334168#comment-14334168 ] Brennon York commented on SPARK-5312: - I'm not so sure the sbt {{show compile:discoveredMainClasses}} call will give the output we want in this case. For reference, from the documentation of that call states that "sbt detects the classes with public, static main methods for use by the run method" and, to replicate the grep/sed code, we'd need a sbt call to gather *all* {{public}} classes and not just the {{public static main}} classes. If the goal is merely to rework the grep/sed code I could see what I can do, but I'm not sure its going to get much better. I checked the sbt doc's for something resembling what we're looking for and it doesn't seem they have that feature. Thoughts? > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334166#comment-14334166 ] Marcelo Vanzin commented on SPARK-5124: --- Hi [~zsxwing], I briefly looked over the design and parts of the implementation, and I have some comments. Overall this looks OK, even if it kinda does look a lot like akka's actor system, but I don't think there would be a problem implementing it over a different RPC library. One thing that I found very confusing is {{RpcEndpoint}} vs. {{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies networking. I think Reynold touched on this when he talked about using an event loop for "local actors". Perhaps a more flexible approach would be something like: - A generic {{Endpoint}} that defines the {{receive()}} method and other generic interfaces - {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote connections. Things like "onConnected" and "onDisassociated" become messages passed to {{receive()}}, kinda like in akka today. This means that there's no need for a specialized {{RpcEndpoint}} interface. - "local" Endpoints could be exposed directly, without the need for an {{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some wrapper to expose {{send()}} and {{ask()}} so that the client interface looks similar for remote and local endpoints. - The default {{Endpoint}} has no thread-safety guarantees. You can wrap an {{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a queue, or synchronize your {{receive()}} method (although that can block the dispatcher thread, which could be bad). But this would easily allow actors to process multiple messages concurrently is desired. What do you think? > Standardize internal RPC interface > -- > > Key: SPARK-5124 > URL: https://issues.apache.org/jira/browse/SPARK-5124 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf > > > In Spark we use Akka as the RPC layer. It would be great if we can > standardize the internal RPC interface to facilitate testing. This will also > provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3436) Streaming SVM
[ https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334161#comment-14334161 ] Anand Iyer commented on SPARK-3436: --- What is the original Jira (that this duplicates)? > Streaming SVM > -- > > Key: SPARK-3436 > URL: https://issues.apache.org/jira/browse/SPARK-3436 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Liquan Pei > > Implement online learning with kernels according to > http://users.cecs.anu.edu.au/~williams/papers/P172.pdf > The algorithms proposed in the above paper are implemented in R > (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib > (http://doc.madlib.net/latest/group__grp__kernmach.html) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5957) Better handling of default parameter values.
Xiangrui Meng created SPARK-5957: Summary: Better handling of default parameter values. Key: SPARK-5957 URL: https://issues.apache.org/jira/browse/SPARK-5957 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng We store the default value of a parameter in the Param instance. In many cases, the default value depends on the algorithm and other parameters defined in the same algorithm. We need to think a better approach to handle default parameter values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5956) Transformer/Estimator should be copyable.
Xiangrui Meng created SPARK-5956: Summary: Transformer/Estimator should be copyable. Key: SPARK-5956 URL: https://issues.apache.org/jira/browse/SPARK-5956 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Xiangrui Meng In a pipeline, we don't save additional params specified in `fit()` to transformers, because we should not modify them. The current solution is to store training parameters in the pipeline model and apply those parameters at `transform()`. A better solution would be making transformers copyable. Calling `.copy` on a transformer produces a new transformer with a different UID but same parameters. Then we can use the copied transformers in the pipeline model, with additional params stored. `copy` may not be a good name because it is not an exact copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5886) Add LabelIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334137#comment-14334137 ] Apache Spark commented on SPARK-5886: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4735 > Add LabelIndexer > > > Key: SPARK-5886 > URL: https://issues.apache.org/jira/browse/SPARK-5886 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `LabelIndexer` takes a column of labels (raw categories) and outputs an > integer column with labels indexed by their frequency. > {code} > va li = new LabelIndexer() > .setInputCol("country") > .setOutputCol("countryIndex") > {code} > In the output column, we should store the label to index map as an ML > attribute. The index should be ordered by frequency, where the most frequent > label gets index 0, to enhance sparsity. > We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5886) Add LabelIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5886: - Assignee: Xiangrui Meng > Add LabelIndexer > > > Key: SPARK-5886 > URL: https://issues.apache.org/jira/browse/SPARK-5886 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `LabelIndexer` takes a column of labels (raw categories) and outputs an > integer column with labels indexed by their frequency. > {code} > va li = new LabelIndexer() > .setInputCol("country") > .setOutputCol("countryIndex") > {code} > In the output column, we should store the label to index map as an ML > attribute. The index should be ordered by frequency, where the most frequent > label gets index 0, to enhance sparsity. > We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334107#comment-14334107 ] ding commented on SPARK-4105: - I am taking annual leave from 2/14 - 2/26. Sorry to bring the inconvience. Best regards DingDing > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's another occurrence of a similar error: > {code} > java.io.IOException: fai
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334059#comment-14334059 ] Yashwanth Rao Dannamaneni commented on SPARK-4105: -- I have a similar issue. I have a job which occasionally fails with this error. I am using Spark 1.2.0 with HDP 2.1.1. > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's another occ
[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334049#comment-14334049 ] Brennon York commented on SPARK-4123: - [~nchammas] have you started this? If not I can take this off your hands, just let me know! > Show new dependencies added in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Priority: Critical > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334039#comment-14334039 ] Sandy Ryza commented on SPARK-5490: --- The relevant JIRA is SPARK-732, but it's marked as "Won't Fix", so it's unclear whether that's the solution. > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Labels: clustering > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2336: - Labels: clustering features (was: clustering features newbie) > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1&language=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2308: - Labels: clustering (was: ) > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: RJ Nowling >Priority: Minor > Labels: clustering > Attachments: many_small_centers.pdf, uneven_centers.pdf > > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2344: - Labels: clustering (was: ) > Add Fuzzy C-Means algorithm to MLlib > > > Key: SPARK-2344 > URL: https://issues.apache.org/jira/browse/SPARK-2344 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Alex >Priority: Minor > Labels: clustering > Original Estimate: 1m > Remaining Estimate: 1m > > I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. > FCM is very similar to K - Means which is already implemented, and they > differ only in the degree of relationship each point has with each cluster: > (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. > As part of the implementation I would like: > - create a base class for K- Means and FCM > - implement the relationship for each algorithm differently (in its class) > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3261: - Labels: clustering (was: ) > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2335: - Labels: clustering features (was: features) > k-Nearest Neighbor classification and regression for MLLib > -- > > Key: SPARK-2335 > URL: https://issues.apache.org/jira/browse/SPARK-2335 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > The k-Nearest Neighbor model for classification and regression problems is a > simple and intuitive approach, offering a straightforward path to creating > non-linear decision/estimation contours. It's downsides -- high variance > (sensitivity to the known training data set) and computational intensity for > estimating new point labels -- both play to Spark's big data strengths: lots > of data mitigates data concerns; lots of workers mitigate computational > latency. > We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2336: - Labels: clustering features newbie (was: features newbie) > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1&language=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger
[ https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2138: - Labels: clustering (was: ) > The KMeans algorithm in the MLlib can lead to the Serialized Task size become > bigger and bigger > --- > > Key: SPARK-2138 > URL: https://issues.apache.org/jira/browse/SPARK-2138 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 0.9.0, 0.9.1 >Reporter: DjvuLee >Assignee: Xiangrui Meng > Labels: clustering > > When the algorithm running at certain stage, when running the reduceBykey() > function, It can lead to Executor Lost and Task lost, after several times. > the application exit. > When this error occurred, the size of serialized task is bigger than 10MB, > and the size become larger as the iteration increase. > the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622 > the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations
[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3504: - Labels: clustering (was: ) > KMeans optimization: track distances and unmoved cluster centers across > iterations > -- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > Labels: clustering > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334038#comment-14334038 ] Brennon York commented on SPARK-3850: - This made it into the [master branch|https://github.com/apache/spark/blob/master/scalastyle-config.xml#L54] at some point already. We can close this issue. /cc [~srowen] [~pwendell] > Scala style: disallow trailing spaces > - > > Key: SPARK-3850 > URL: https://issues.apache.org/jira/browse/SPARK-3850 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Nicholas Chammas > > [Ted Yu on the dev > list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] > suggested using {{WhitespaceEndOfLineChecker}} here: > http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3219: - Labels: clustering (was: ) > K-Means clusterer should support Bregman distance functions > --- > > Key: SPARK-3219 > URL: https://issues.apache.org/jira/browse/SPARK-3219 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Derrick Burns >Assignee: Derrick Burns > Labels: clustering > > The K-Means clusterer supports the Euclidean distance metric. However, it is > rather straightforward to support Bregman > (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) > distance functions which would increase the utility of the clusterer > tremendously. > I have modified the clusterer to support pluggable distance functions. > However, I notice that there are hundreds of outstanding pull requests. If > someone is willing to work with me to sponsor the work through the process, I > will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2429: - Labels: clustering (was: ) > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3439: - Labels: clustering (was: ) > Add Canopy Clustering Algorithm > --- > > Key: SPARK-3439 > URL: https://issues.apache.org/jira/browse/SPARK-3439 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Assignee: Muhammad-Ali A'rabi >Priority: Minor > Labels: clustering > > The canopy clustering algorithm is an unsupervised pre-clustering algorithm. > It is often used as a preprocessing step for the K-means algorithm or the > Hierarchical clustering algorithm. It is intended to speed up clustering > operations on large data sets, where using another algorithm directly may be > impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data
[ https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3218: - Labels: clustering (was: ) > K-Means clusterer can fail on degenerate data > - > > Key: SPARK-3218 > URL: https://issues.apache.org/jira/browse/SPARK-3218 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > Labels: clustering > > The KMeans parallel implementation selects points to be cluster centers with > probability weighted by their distance to cluster centers. However, if there > are fewer than k DISTINCT points in the data set, this approach will fail. > Further, the recent checkin to work around this problem results in selection > of the same point repeatedly as a cluster center. > The fix is to allow fewer than k cluster centers to be selected. This > requires several changes to the code, as the number of cluster centers is > woven into the implementation. > I have a version of the code that addresses this problem, AND generalizes the > distance metric. However, I see that there are literally hundreds of > outstanding pull requests. If someone will commit to working with me to > sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel
[ https://issues.apache.org/jira/browse/SPARK-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3220: - Labels: clustering (was: ) > K-Means clusterer should perform K-Means initialization in parallel > --- > > Key: SPARK-3220 > URL: https://issues.apache.org/jira/browse/SPARK-3220 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Derrick Burns > Labels: clustering > > The LocalKMeans method should be replaced with a parallel implementation. As > it stands now, it becomes a bottleneck for large data sets. > I have implemented this functionality in my version of the clusterer. > However, I see that there are hundreds of outstanding pull requests. If > someone on the team wants to sponsor the pull request, I will create one. > Otherwise, I will just maintain my own private fork of the clusterer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4039) KMeans support sparse cluster centers
[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4039: - Labels: clustering (was: ) > KMeans support sparse cluster centers > - > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Antoine Amend > Labels: clustering > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
[ https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5272: - Labels: clustering (was: ) > Refactor NaiveBayes to support discrete and continuous labels,features > -- > > Key: SPARK-5272 > URL: https://issues.apache.org/jira/browse/SPARK-5272 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > Labels: clustering > > This JIRA is to discuss refactoring NaiveBayes in order to support both > discrete and continuous labels and features. > Currently, NaiveBayes supports only discrete labels and features. > Proposal: Generalize it to support continuous values as well. > Some items to discuss are: > * How commonly are continuous labels/features used in practice? (Is this > necessary?) > * What should the API look like? > ** E.g., should NB have multiple classes for each type of label/feature, or > should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5490: - Labels: clustering (was: ) > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Labels: clustering > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data
[ https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334037#comment-14334037 ] Derrick Burns commented on SPARK-5405: -- Agreed. On Mon, Feb 23, 2015 at 2:48 PM, Xiangrui Meng (JIRA) > Spark clusterer should support high dimensional data > > > Key: SPARK-5405 > URL: https://issues.apache.org/jira/browse/SPARK-5405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Labels: clustering > Original Estimate: 504h > Remaining Estimate: 504h > > The MLLIB clusterer works well for low (<200) dimensional data. However, > performance is linear with the number of dimensions. So, for practical > purposes, it is not very useful for high dimensional data. > Depending on the data type, one can embed the high dimensional data into > lower dimensional spaces in a distance-preserving way. The Spark clusterer > should support such embedding. > An example implementation that supports high dimensional data is here: > https://github.com/derrickburns/generalized-kmeans-clustering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5016: - Labels: clustering (was: ) > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > Labels: clustering > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5927) Modify FPGrowth's partition strategy to reduce transactions in partitions
[ https://issues.apache.org/jira/browse/SPARK-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-5927. Resolution: Won't Fix I'm closing this JIRA per discussion on the PR page. > Modify FPGrowth's partition strategy to reduce transactions in partitions > - > > Key: SPARK-5927 > URL: https://issues.apache.org/jira/browse/SPARK-5927 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Liang-Chi Hsieh >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5832: - Labels: clustering (was: ) > Add Affinity Propagation clustering algorithm > - > > Key: SPARK-5832 > URL: https://issues.apache.org/jira/browse/SPARK-5832 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: clustering > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5490: - Target Version/s: 1.4.0 > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
[ https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334030#comment-14334030 ] Xiangrui Meng commented on SPARK-5490: -- [~sandyr] This is a bug in core. Could you link the JIRA? > KMeans costs can be incorrect if tasks need to be rerun > --- > > Key: SPARK-5490 > URL: https://issues.apache.org/jira/browse/SPARK-5490 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > KMeans uses accumulators to compute the cost of a clustering at each > iteration. > Each time a ShuffleMapTask completes, it increments the accumulators at the > driver. If a task runs twice because of failures, the accumulators get > incremented twice. > KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost > can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data
[ https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334027#comment-14334027 ] Xiangrui Meng commented on SPARK-5405: -- Dimension reduction should be separated from the k-means implementation. We can add distance-preserving methods as feature transformers. [~derrickburns] Could you update the JIRA title? > Spark clusterer should support high dimensional data > > > Key: SPARK-5405 > URL: https://issues.apache.org/jira/browse/SPARK-5405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Labels: clustering > Original Estimate: 504h > Remaining Estimate: 504h > > The MLLIB clusterer works well for low (<200) dimensional data. However, > performance is linear with the number of dimensions. So, for practical > purposes, it is not very useful for high dimensional data. > Depending on the data type, one can embed the high dimensional data into > lower dimensional spaces in a distance-preserving way. The Spark clusterer > should support such embedding. > An example implementation that supports high dimensional data is here: > https://github.com/derrickburns/generalized-kmeans-clustering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334025#comment-14334025 ] Brennon York commented on SPARK-1182: - Given [~joshrosen]'s comments on the PR making merge-conflict hell, would it be better just to scratch this as an issue and close everything out? Its either that or deal with all the merge conflicts for any / all backports moving forward. Thoughts? > Sort the configuration parameters in configuration.md > - > > Key: SPARK-1182 > URL: https://issues.apache.org/jira/browse/SPARK-1182 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Reynold Xin >Assignee: prashant >Priority: Minor > > It is a little bit confusing right now since the config options are all over > the place in some arbitrarily sorted order. > https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4039) KMeans support sparse cluster centers
[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-4039: -- > KMeans support sparse cluster centers > - > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Antoine Amend > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4039) KMeans support sparse cluster centers
[ https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-4039. Resolution: Duplicate > KMeans support sparse cluster centers > - > > Key: SPARK-4039 > URL: https://issues.apache.org/jira/browse/SPARK-4039 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Antoine Amend > > When the number of features is not known, it might be quite helpful to create > sparse vectors using HashingTF.transform. KMeans transforms centers vectors > to dense vectors > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307), > therefore leading to OutOfMemory (even with small k). > Any way to keep vectors sparse ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5405) Spark clusterer should support high dimensional data
[ https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5405: - Labels: clustering (was: ) > Spark clusterer should support high dimensional data > > > Key: SPARK-5405 > URL: https://issues.apache.org/jira/browse/SPARK-5405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Labels: clustering > Original Estimate: 504h > Remaining Estimate: 504h > > The MLLIB clusterer works well for low (<200) dimensional data. However, > performance is linear with the number of dimensions. So, for practical > purposes, it is not very useful for high dimensional data. > Depending on the data type, one can embed the high dimensional data into > lower dimensional spaces in a distance-preserving way. The Spark clusterer > should support such embedding. > An example implementation that supports high dimensional data is here: > https://github.com/derrickburns/generalized-kmeans-clustering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334021#comment-14334021 ] Xiangrui Meng commented on SPARK-5261: -- Could you try a larger minCount to reduce the vocabulary size? > In some cases ,The value of word's vector representation is too big > --- > > Key: SPARK-5261 > URL: https://issues.apache.org/jira/browse/SPARK-5261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Guoqiang Li > > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(36) > {code} > The average absolute value of the word's vector representation is 60731.8 > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(1) > {code} > The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop
[ https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334020#comment-14334020 ] Brennon York commented on SPARK-794: [~srowen] [~joshrosen] bump on this. Would assume things are stable with the removal of the sleep method, but want to double check. Thinking we can close this ticket out. > Remove sleep() in ClusterScheduler.stop > --- > > Key: SPARK-794 > URL: https://issues.apache.org/jira/browse/SPARK-794 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.9.0 >Reporter: Matei Zaharia > Labels: backport-needed > Fix For: 1.3.0 > > > This temporary change made a while back slows down the unit tests quite a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5226: - Labels: DBSCAN clustering (was: DBSCAN) > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats
[ https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334013#comment-14334013 ] Magellanea commented on SPARK-5940: --- [~lukovnikov] Thanks a lot for the reply, Do you know who can I mention/tag in this discussion from the Spark/GraphX team? Thanks > Graph Loader: refactor + add more formats > - > > Key: SPARK-5940 > URL: https://issues.apache.org/jira/browse/SPARK-5940 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: lukovnikov >Priority: Minor > > Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] > adds a RDF graph loader. > However, as Takeshi Yamamuro suggested on github [SPARK-5280], > https://github.com/apache/spark/pull/4650, it might be interesting to make > GraphLoader an interface with several implementations for different formats. > And maybe it's good to make a façade graph loader that provides a unified > interface to all loaders. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5010) native openblas library doesn't work: undefined symbol: cblas_dscal
[ https://issues.apache.org/jira/browse/SPARK-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-5010. Resolution: Not a Problem I'm closing this PR because it is a upstream issue with the native BLAS library. > native openblas library doesn't work: undefined symbol: cblas_dscal > --- > > Key: SPARK-5010 > URL: https://issues.apache.org/jira/browse/SPARK-5010 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 > Environment: standalone >Reporter: Tomas Hudik >Priority: Minor > Labels: mllib, openblas > > 1. compiled and installed open blas library > 2, ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3 > 3. compiled and built spark: > mvn -Pnetlib-lgpl -DskipTests clean compile package > 4. run: bin/run-example mllib.LinearRegression > data/mllib/sample_libsvm_data.txt > 14/12/30 18:39:57 INFO BlockManagerMaster: Trying to register BlockManager > 14/12/30 18:39:57 INFO BlockManagerMasterActor: Registering block manager > localhost:34297 with 265.1 MB RAM, BlockManagerId(, localhost, 34297) > 14/12/30 18:39:57 INFO BlockManagerMaster: Registered BlockManager > 14/12/30 18:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/12/30 18:39:58 WARN LoadSnappy: Snappy native library not loaded > Training: 80, test: 20. > /usr/local/lib/jdk1.8.0//bin/java: symbol lookup error: > /tmp/jniloader1826801168744171087netlib-native_system-linux-x86_64.so: > undefined symbol: cblas_dscal > I followed guide: https://spark.apache.org/docs/latest/mllib-guide.html > section dependencies. > Am I missing something? > How to force Spark to use openblas library? > Thanks, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org