[jira] [Updated] (SPARK-5914) Enable spark-submit to run with only user permission on windows

2015-02-23 Thread Judy Nash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Judy Nash updated SPARK-5914:
-
Summary: Enable spark-submit to run with only user permission on windows  
(was: Spark-submit cannot execute without machine admin permission on windows)

> Enable spark-submit to run with only user permission on windows
> ---
>
> Key: SPARK-5914
> URL: https://issues.apache.org/jira/browse/SPARK-5914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, Windows
> Environment: Windows
>Reporter: Judy Nash
>Priority: Minor
>
> On windows platform only. 
> If slave is executed with user permission, spark-submit fails with 
> java.lang.ClassNotFoundException when attempting to read the cached jar from 
> spark_home\work folder. 
> This is due to the jars do not have read permission set by default on 
> windows. Fix is to add read permission explicitly for owner of the file. 
> Having service account running as admin (equivalent of sudo in Linux) is a 
> major security risk for production clusters. This make it easy for hackers to 
> compromise the cluster by taking over the service account. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5968) Parquet warning in spark-shell

2015-02-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-5968:
---

 Summary: Parquet warning in spark-shell
 Key: SPARK-5968
 URL: https://issues.apache.org/jira/browse/SPARK-5968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical


{code}
15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for 
rankings
parquet.io.ParquetEncodingException: 
file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: 
all the files must be contained in the root rankings
at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at 
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
{code}

it is only a warning, but kind of scary for the user.  We should try to 
suppress this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334534#comment-14334534
 ] 

Apache Spark commented on SPARK-5967:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4741

> JobProgressListener.stageIdToActiveJobIds never cleared
> ---
>
> Key: SPARK-5967
> URL: https://issues.apache.org/jira/browse/SPARK-5967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Tathagata Das
>Assignee: Josh Rosen
>Priority: Blocker
>
> The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So 
> the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 
> running applications like Spark Streaming apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5967:
-
Affects Version/s: 1.2.0
   1.2.1

> JobProgressListener.stageIdToActiveJobIds never cleared
> ---
>
> Key: SPARK-5967
> URL: https://issues.apache.org/jira/browse/SPARK-5967
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Tathagata Das
>Assignee: Josh Rosen
>Priority: Blocker
>
> The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So 
> the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 
> running applications like Spark Streaming apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5967) JobProgressListener.stageIdToActiveJobIds never cleared

2015-02-23 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-5967:


 Summary: JobProgressListener.stageIdToActiveJobIds never cleared
 Key: SPARK-5967
 URL: https://issues.apache.org/jira/browse/SPARK-5967
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Tathagata Das
Assignee: Josh Rosen
Priority: Blocker


The hashmap stageIdToActiveJobIds in JobProgressListener is never cleared. So 
the hashmap keep increasing in size indefinitely. This is a blocker for 24/7 
running applications like Spark Streaming apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-02-23 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334499#comment-14334499
 ] 

Twinkle Sachdeva commented on SPARK-4705:
-

Hi [~vanzin],

Please have a look at the Update UI- II.

Thanks,
Twinkle

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334493#comment-14334493
 ] 

Apache Spark commented on SPARK-5965:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/4739

> Spark UI does not show main class when running app in standalone cluster mode
> -
>
> Key: SPARK-5965
> URL: https://issues.apache.org/jira/browse/SPARK-5965
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> I tried this.
> bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 
> --deploy-mode cluster  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
>  100 0.5
> The app got launched correctly but the Spark web UI showed the following 
> screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode

2015-02-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5965:
-
Target Version/s: 1.3.0

> Spark UI does not show main class when running app in standalone cluster mode
> -
>
> Key: SPARK-5965
> URL: https://issues.apache.org/jira/browse/SPARK-5965
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> I tried this.
> bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 
> --deploy-mode cluster  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
>  100 0.5
> The app got launched correctly but the Spark web UI showed the following 
> screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled

2015-02-23 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: Updated UI - II.png

> Driver retries in cluster mode always fail if event logging is enabled
> --
>
> Key: SPARK-4705
> URL: https://issues.apache.org/jira/browse/SPARK-4705
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
> Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - 
> II.png
>
>
> yarn-cluster mode will retry to run the driver in certain failure modes. If 
> even logging is enabled, this will most probably fail, because:
> {noformat}
> Exception in thread "Driver" java.io.IOException: Log directory 
> hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
>  already exists!
> at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
> at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
> at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
> at org.apache.spark.SparkContext.(SparkContext.scala:353)
> {noformat}
> The even log path should be "more unique". Or perhaps retries of the same app 
> should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334469#comment-14334469
 ] 

Saisai Shao commented on SPARK-5953:


I think the decoder is initialized in the executor side, so your class should 
be found in executor side. One way is to add the jar with customized decoder 
class included using SparkContext#addJar.

> NoSuchMethodException with a Kafka input stream and custom decoder in Scala
> ---
>
> Key: SPARK-5953
> URL: https://issues.apache.org/jira/browse/SPARK-5953
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.2.0, 1.2.1
> Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
>Reporter: Aleksandar Stojadinovic
>
> When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
> throws an exception upon starting:
> {noformat}
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
> receiver 0 - java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
> 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The stream is initialized with:
> {code:title=Main.scala|borderStyle=solid}
>  val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
> kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
> topicMap, StorageLevel.MEMORY_AND_DISK);
> {code}
> The decoder:
> {code:title=UserLocationEventDecoder.scala|borderStyle=solid}
> import kafka.serializer.Decoder
> class UserLocationEventDecoder extends Decoder[UserLocationEvent] {
>   val kryo = new Kryo()
>   override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
> val input: Input = new Input(new ByteArrayInputStream(bytes))
> val userLocationEvent: UserLo

[jira] [Resolved] (SPARK-5958) Update block matrix user guide

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5958.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Update block matrix user guide
> --
>
> Key: SPARK-5958
> URL: https://issues.apache.org/jira/browse/SPARK-5958
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.3.0
>
>
> There are some minor issues (see the PR) with the current block matrix user 
> guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334351#comment-14334351
 ] 

Nicholas Chammas edited comment on SPARK-3850 at 2/24/15 5:16 AM:
--

{quote}
enabled="false"
{quote}

Per the parent issue SPARK-3849, I believe this issue is about enabling this 
rule in a non-intrusive way. So I think we still need this issue.


was (Author: nchammas):
{quote}
enabled="false"
{quote}

Per the parent issue SPARK-3849, I believe this issue about enabling this rule 
in a non-intrusive way. So I think we still need this issue.

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using {{WhitespaceEndOfLineChecker}} here: 
> http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5966:
-
Description: 
{code}

[tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
App arguments:
Usage: MemoryTest  

[tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster  
--class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
{{

[tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
App arguments:
Usage: MemoryTest  

[tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster  
--class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
}}


> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5966:
-
Description: 
{{

[tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
App arguments:
Usage: MemoryTest  

[tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster  
--class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
}}

  was:
{{code}}
[tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
App arguments:
Usage: MemoryTest  

[tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster  
--class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{{code}}


> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {{
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> }}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-02-23 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-5966:


 Summary: Spark-submit deploy-mode incorrectly affecting submission 
when master = local[4] 
 Key: SPARK-5966
 URL: https://issues.apache.org/jira/browse/SPARK-5966
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
Reporter: Tathagata Das
Assignee: Andrew Or
Priority: Critical


{{code}}
[tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
App arguments:
Usage: MemoryTest  

[tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster  
--class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5965:
-
Attachment: screenshot-1.png

> Spark UI does not show main class when running app in standalone cluster mode
> -
>
> Key: SPARK-5965
> URL: https://issues.apache.org/jira/browse/SPARK-5965
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> I tried this.
> bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 
> --deploy-mode cluster  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
>  100 0.5
> The app got launched correctly but the Spark web UI showed the following 
> screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5965) Spark UI does not show main class when running app in standalone cluster mode

2015-02-23 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-5965:


 Summary: Spark UI does not show main class when running app in 
standalone cluster mode
 Key: SPARK-5965
 URL: https://issues.apache.org/jira/browse/SPARK-5965
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.0
Reporter: Tathagata Das
Assignee: Andrew Or
Priority: Blocker


I tried this.

bin/spark-submit --verbose --supervise --master spark://Zion.local:7077 
--deploy-mode cluster  --class test.MemoryTest --driver-memory 1G 
/Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
 100 0.5

The app got launched correctly but the Spark web UI showed the following 
screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361
 ] 

Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:02 AM:
-

Actually when I filed this ticket, I don't think there was extensive high level 
grouping of config options (there was maybe two or three). I took a look at the 
latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we 
put stuff like this?


Within each section, config options should probably be sorted alphabetically.




was (Author: rxin):
Actually when I filed this ticket, I don't think there was extensive high level 
grouping of config options (there was maybe two or three). I took a look at the 
latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we 
put stuff like this?




> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361
 ] 

Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:01 AM:
-

Actually when I filed this ticket, I don't think there was any extensive high 
level grouping of config options (there was maybe two or three). I took a look 
at the latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we 
put stuff like this?





was (Author: rxin):
Actually when I filed this ticket, I don't think there was any extensive high 
level grouping of config options (there was maybe two or three). I took a look 
at the latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it should also belong in UI. Where 
do we put stuff like this?




> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361
 ] 

Reynold Xin edited comment on SPARK-1182 at 2/24/15 4:01 AM:
-

Actually when I filed this ticket, I don't think there was extensive high level 
grouping of config options (there was maybe two or three). I took a look at the 
latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we 
put stuff like this?





was (Author: rxin):
Actually when I filed this ticket, I don't think there was any extensive high 
level grouping of config options (there was maybe two or three). I took a look 
at the latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it also belongs in UI. Where do we 
put stuff like this?




> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334361#comment-14334361
 ] 

Reynold Xin commented on SPARK-1182:


Actually when I filed this ticket, I don't think there was any extensive high 
level grouping of config options (there was maybe two or three). I took a look 
at the latest grouping and they looked reasonable. 

A few problems with the existing grouping:

1. Networking section actually contains some shuffle settings. Those should be 
moved into shuffle.

2. Application Properties contains some serialization settings. Those should be 
moved into Compression and Serialization.

3. Runtime Environment contains some Mesos specific setting. All the way down 
on the page we link to each resource manager's own page for resource manager 
specific settings.

4. Scheduling section contains a mesos specific setting (spark.mesos.coarse).

5. It is sort of arbitrary that "Dynamic allocation" has its own section (at 
the very least, "a" should be upper case.)

6. "spark.ui.view.acls" is in security, but it should also belong in UI. Where 
do we put stuff like this?




> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5964) Allow spark-daemon.sh to support foreground operation

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334359#comment-14334359
 ] 

Apache Spark commented on SPARK-5964:
-

User 'hellertime' has created a pull request for this issue:
https://github.com/apache/spark/pull/3881

> Allow spark-daemon.sh to support foreground operation
> -
>
> Key: SPARK-5964
> URL: https://issues.apache.org/jira/browse/SPARK-5964
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Chris Heller
>Priority: Minor
>
> Add --foreground option to spark-daemon.sh to prevent the process from 
> daemonizing itself. Useful if running under a watchdog which waits on its 
> child process.
> https://github.com/apache/spark/pull/3881



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5964) Allow spark-daemon.sh to support foreground operation

2015-02-23 Thread Chris Heller (JIRA)
Chris Heller created SPARK-5964:
---

 Summary: Allow spark-daemon.sh to support foreground operation
 Key: SPARK-5964
 URL: https://issues.apache.org/jira/browse/SPARK-5964
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Chris Heller
Priority: Minor


Add --foreground option to spark-daemon.sh to prevent the process from 
daemonizing itself. Useful if running under a watchdog which waits on its child 
process.

https://github.com/apache/spark/pull/3881



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334355#comment-14334355
 ] 

Nicholas Chammas commented on SPARK-5312:
-

Yeah, this is not a priority really. I looked into sbt and agree it's probably 
not suited to the task. I found something else that looks interesting: 
http://software.clapper.org/classutil/

But I don't have time to look into it.

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334352#comment-14334352
 ] 

Nicholas Chammas commented on SPARK-4123:
-

Go ahead! I haven't done anything for this yet.

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334351#comment-14334351
 ] 

Nicholas Chammas commented on SPARK-3850:
-

{quote}
enabled="false"
{quote}

Per the parent issue SPARK-3849, I believe this issue about enabling this rule 
in a non-intrusive way. So I think we still need this issue.

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using {{WhitespaceEndOfLineChecker}} here: 
> http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334338#comment-14334338
 ] 

Reynold Xin commented on SPARK-1182:


cc [~pwendell] what do you think? I think it is probably still worth it to do 
that, since the config page is sort of a mess. The backporting documentation 
fix thing should be small. I don't think we backport documentation fixes that 
often.




> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5960:
-
Assignee: Chris Fregly

> Allow AWS credentials to be passed to KinesisUtils.createStream()
> -
>
> Key: SPARK-5960
> URL: https://issues.apache.org/jira/browse/SPARK-5960
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> While IAM roles are preferable, we're seeing a lot of cases where we need to 
> pass AWS credentials when creating the KinesisReceiver.
> Notes:
> * Make sure we don't log the credentials anywhere
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5959:
-
Assignee: Chris Fregly

> Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
> -
>
> Key: SPARK-5959
> URL: https://issues.apache.org/jira/browse/SPARK-5959
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> After each block is stored reliably in the WAL (after the store() call 
> returns),  ACK back to Kinesis.
> There is still the issue of the ReliableKinesisReceiver dying before the ACK 
> back to Kinesis, however no data will be lost.  Duplicate data is still 
> possible.
> Notes:
> * Make sure we're not overloading the checkpoint control plane which uses 
> DynamoDB.
> * May need to disable auto-checkpointing and remove the checkpoint interval.
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability

2015-02-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334292#comment-14334292
 ] 

Takeshi Yamamuro commented on SPARK-5790:
-

Thanks for your work :)

> VertexRDD's won't zip properly for `diff` capability
> 
>
> Key: SPARK-5790
> URL: https://issues.apache.org/jira/browse/SPARK-5790
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Brennon York
>Assignee: Brennon York
>
> For VertexRDD's with differing partition sizes one cannot run commands like 
> `diff` as it will thrown an IllegalArgumentException. The code below provides 
> an example:
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd._
> val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id => 
> (id, id.toInt+1)))
> setA.collect.foreach(println(_))
> val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id => 
> (id, id.toInt+2)))
> setB.collect.foreach(println(_))
> val diff = setA.diff(setB)
> diff.collect.foreach(println(_))
> val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id => 
> (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id => (id, id.toInt+2)))
> setA.diff(setC).collect
> // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
> partitions
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5517) Add input types for Java UDFs

2015-02-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334279#comment-14334279
 ] 

Michael Armbrust commented on SPARK-5517:
-

Given the verbosity of expressing this in Java, I'm thinking we will always 
want it to be optional.  In that case I propose we just add a parallel set of 
APIs once this information is actually used.  Thus, we can bump this to 1.4.

> Add input types for Java UDFs
> -
>
> Key: SPARK-5517
> URL: https://issues.apache.org/jira/browse/SPARK-5517
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5517) Add input types for Java UDFs

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5517:

Target Version/s: 1.4.0  (was: 1.3.0)

> Add input types for Java UDFs
> -
>
> Key: SPARK-5517
> URL: https://issues.apache.org/jira/browse/SPARK-5517
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5517) Add input types for Java UDFs

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5517:
---

Assignee: Michael Armbrust  (was: Reynold Xin)

> Add input types for Java UDFs
> -
>
> Key: SPARK-5517
> URL: https://issues.apache.org/jira/browse/SPARK-5517
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5963:
-
Assignee: Stephen Boesch

> [MLLIB] Python support for Power Iteration Clustering
> -
>
> Key: SPARK-5963
> URL: https://issues.apache.org/jira/browse/SPARK-5963
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Stephen Boesch
>Assignee: Stephen Boesch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Add python support for the Power Iteration Clustering feature.  Here is a 
> fragment of the python API as we plan to implement it:
>   /**
>* Java stub for Python mllib PowerIterationClustering.run()
>*/
>   def trainPowerIterationClusteringModel(
>   data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
>   k: Int,
>   maxIterations: Int,
>   runs: Int,
>   initializationMode: String,
>   seed: java.lang.Long): PowerIterationClusteringModel = {
> val picAlg = new PowerIterationClustering()
>   .setK(k)
>   .setMaxIterations(maxIterations)
> try {
>   picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
> } finally {
>   data.rdd.unpersist(blocking = false)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334265#comment-14334265
 ] 

Apache Spark commented on SPARK-5532:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/4738

> Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
> 
>
> Key: SPARK-5532
> URL: https://issues.apache.org/jira/browse/SPARK-5532
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Michael Armbrust
>Priority: Critical
>
> Deterministic failure:
> {code}
> import org.apache.spark.mllib.linalg._
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val data = sc.parallelize(Seq((1.0, 
> Vectors.dense(1,2,3.toDataFrame("label", "features")
> data.repartition(1).saveAsParquetFile("blah")
> {code}
> If you remove the repartition, then this succeeds.
> Here's the stack trace:
> {code}
> 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
> 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 7, 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>

[jira] [Created] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering

2015-02-23 Thread Stephen Boesch (JIRA)
Stephen Boesch created SPARK-5962:
-

 Summary: [MLLIB] Python support for Power Iteration Clustering
 Key: SPARK-5962
 URL: https://issues.apache.org/jira/browse/SPARK-5962
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Stephen Boesch


Add python support for the Power Iteration Clustering feature.  Here is a 
fragment of the python API as we plan to implement it:

  /**
   * Java stub for Python mllib PowerIterationClustering.run()
   */
  def trainPowerIterationClusteringModel(
  data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
  k: Int,
  maxIterations: Int,
  runs: Int,
  initializationMode: String,
  seed: java.lang.Long): PowerIterationClusteringModel = {
val picAlg = new PowerIterationClustering()
  .setK(k)
  .setMaxIterations(maxIterations)

try {
  picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
} finally {
  data.rdd.unpersist(blocking = false)
}
  }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering

2015-02-23 Thread Stephen Boesch (JIRA)
Stephen Boesch created SPARK-5963:
-

 Summary: [MLLIB] Python support for Power Iteration Clustering
 Key: SPARK-5963
 URL: https://issues.apache.org/jira/browse/SPARK-5963
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Stephen Boesch


Add python support for the Power Iteration Clustering feature.  Here is a 
fragment of the python API as we plan to implement it:

  /**
   * Java stub for Python mllib PowerIterationClustering.run()
   */
  def trainPowerIterationClusteringModel(
  data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
  k: Int,
  maxIterations: Int,
  runs: Int,
  initializationMode: String,
  seed: java.lang.Long): PowerIterationClusteringModel = {
val picAlg = new PowerIterationClustering()
  .setK(k)
  .setMaxIterations(maxIterations)

try {
  picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
} finally {
  data.rdd.unpersist(blocking = false)
}
  }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334256#comment-14334256
 ] 

Chris Fregly commented on SPARK-5959:
-

Checkpointing at a specific sequence number is supported by the 
IRecordProcessorCheckpointer interface.

> Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
> -
>
> Key: SPARK-5959
> URL: https://issues.apache.org/jira/browse/SPARK-5959
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>
> After each block is stored reliably in the WAL (after the store() call 
> returns),  ACK back to Kinesis.
> There is still the issue of the ReliableKinesisReceiver dying before the ACK 
> back to Kinesis, however no data will be lost.  Duplicate data is still 
> possible.
> Notes:
> * Make sure we're not overloading the checkpoint control plane which uses 
> DynamoDB.
> * May need to disable auto-checkpointing and remove the checkpoint interval.
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5961) Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes

2015-02-23 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-5961:
---

 Summary: Allow specific nodes in a Spark Streaming cluster to be 
dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes
 Key: SPARK-5961
 URL: https://issues.apache.org/jira/browse/SPARK-5961
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


This type of configuration has come up a lot where certain nodes should be 
dedicated as Spark Streaming Receivers and others as regular Spark Workers. 

The reasons include the following:
1) Different instance types/sizes for Receivers vs regular Workers
2) Different OS tuning params for Receivers vs regular Workers
...

  
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5535) Add parameter for storage levels.

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5535:
-
Assignee: (was: Xiangrui Meng)

> Add parameter for storage levels.
> -
>
> Key: SPARK-5535
> URL: https://issues.apache.org/jira/browse/SPARK-5535
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>
> Add a special parameter type for storage levels that takes both StorageLevels 
> and their string representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-2138.

  Resolution: Not a Problem
Target Version/s:   (was: 1.4.0)

I'm closing this issue with setting a larger akka.frameSize as the workaround. 
SPARK-3424 might be relevant.

> The KMeans algorithm in the MLlib can lead to the Serialized Task size become 
> bigger and bigger
> ---
>
> Key: SPARK-2138
> URL: https://issues.apache.org/jira/browse/SPARK-2138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 0.9.0, 0.9.1
>Reporter: DjvuLee
>Assignee: Xiangrui Meng
>  Labels: clustering
>
> When the algorithm running at certain stage, when running the reduceBykey() 
> function, It can lead to Executor Lost and Task lost, after several times. 
> the application exit.
> When this error occurred, the size of serialized task is bigger than 10MB, 
> and the size become larger as the iteration increase.
> the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622
> the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-23 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-5960:
---

 Summary: Allow AWS credentials to be passed to 
KinesisUtils.createStream()
 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


While IAM roles are preferable, we're seeing a lot of cases where we need to 
pass AWS credentials when creating the KinesisReceiver.

Notes:
* Make sure we don't log the credentials anywhere
* Maintain compatibility with existing KinesisReceiver-based code.
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5959:

  Component/s: Streaming
  Description: 
After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes:
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.
* Maintain compatibility with existing KinesisReceiver-based code.
 


  was:
After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.

 


 Target Version/s: 1.4.0
Affects Version/s: 1.1.0
Fix Version/s: (was: 1.4.0)

> Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
> -
>
> Key: SPARK-5959
> URL: https://issues.apache.org/jira/browse/SPARK-5959
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Chris Fregly
>
> After each block is stored reliably in the WAL (after the store() call 
> returns),  ACK back to Kinesis.
> There is still the issue of the ReliableKinesisReceiver dying before the ACK 
> back to Kinesis, however no data will be lost.  Duplicate data is still 
> possible.
> Notes:
> * Make sure we're not overloading the checkpoint control plane which uses 
> DynamoDB.
> * May need to disable auto-checkpointing and remove the checkpoint interval.
> * Maintain compatibility with existing KinesisReceiver-based code.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-5959:
---

 Summary: Create a ReliableKinesisReceiver similar to the 
ReliableKafkaReceiver
 Key: SPARK-5959
 URL: https://issues.apache.org/jira/browse/SPARK-5959
 Project: Spark
  Issue Type: Improvement
Reporter: Chris Fregly
 Fix For: 1.4.0


After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5958) Update block matrix user guide

2015-02-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5958:


 Summary: Update block matrix user guide
 Key: SPARK-5958
 URL: https://issues.apache.org/jira/browse/SPARK-5958
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


There are some minor issues (see the PR) with the current block matrix user 
guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5958) Update block matrix user guide

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334231#comment-14334231
 ] 

Apache Spark commented on SPARK-5958:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4737

> Update block matrix user guide
> --
>
> Key: SPARK-5958
> URL: https://issues.apache.org/jira/browse/SPARK-5958
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> There are some minor issues (see the PR) with the current block matrix user 
> guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5910) DataFrame.selectExpr("col as newName") does not work

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334230#comment-14334230
 ] 

Apache Spark commented on SPARK-5910:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/4736

> DataFrame.selectExpr("col as newName") does not work
> 
>
> Key: SPARK-5910
> URL: https://issues.apache.org/jira/browse/SPARK-5910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> {code}
> val rdd = sc.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
> sqlContext.jsonRDD(rdd).selectExpr("a as newName")
> {code}
> {code}
> java.lang.RuntimeException: [1.3] failure: ``or'' expected but `as' found
> a as newName
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
> {code}
> For selectExpr, we need to use projection parser instead of expression parser 
> (which cannot parse AS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5532:
---

Assignee: Michael Armbrust  (was: Cheng Lian)

> Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
> 
>
> Key: SPARK-5532
> URL: https://issues.apache.org/jira/browse/SPARK-5532
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Michael Armbrust
>Priority: Critical
>
> Deterministic failure:
> {code}
> import org.apache.spark.mllib.linalg._
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val data = sc.parallelize(Seq((1.0, 
> Vectors.dense(1,2,3.toDataFrame("label", "features")
> data.repartition(1).saveAsParquetFile("blah")
> {code}
> If you remove the repartition, then this succeeds.
> Here's the stack trace:
> {code}
> 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
> 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 7, 192.168.1.230): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$

[jira] [Resolved] (SPARK-5873) Can't see partially analyzed plans

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5873.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

> Can't see partially analyzed plans
> --
>
> Key: SPARK-5873
> URL: https://issues.apache.org/jira/browse/SPARK-5873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Our analysis checks are great for users who make mistakes but make it 
> impossible to see what is going wrong when there is a bug in the analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5722.
-
   Resolution: Fixed
Fix Version/s: 1.2.2

Issue resolved by pull request 4681
[https://github.com/apache/spark/pull/4681]

> Infer_schema_type incorrect for Integers in pyspark
> ---
>
> Key: SPARK-5722
> URL: https://issues.apache.org/jira/browse/SPARK-5722
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Don Drake
> Fix For: 1.2.2
>
>
> The Integers datatype in Python does not match what a Scala/Java integer is 
> defined as.   This causes inference of data types and schemas to fail when 
> data is larger than 2^32 and it is inferred incorrectly as an Integer.
> Since the range of valid Python integers is wider than Java Integers, this 
> causes problems when inferring Integer vs. Long datatypes.  This will cause 
> problems when attempting to save SchemaRDD as Parquet or JSON.
> Here's an example:
> {code}
> >>> sqlCtx = SQLContext(sc)
> >>> from pyspark.sql import Row
> >>> rdd = sc.parallelize([Row(f1='a', f2=100)])
> >>> srdd = sqlCtx.inferSchema(rdd)
> >>> srdd.schema()
> StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
> {code}
> That number is a LongType in Java, but an Integer in python.  We need to 
> check the value to see if it should really by a LongType when a IntegerType 
> is initially inferred.
> More tests:
> {code}
> >>> from pyspark.sql import _infer_type
> # OK
> >>> print _infer_type(1)
> IntegerType
> # OK
> >>> print _infer_type(2**31-1)
> IntegerType
> #WRONG
> >>> print _infer_type(2**31)
> #WRONG
> IntegerType
> >>> print _infer_type(2**61 )
> #OK
> IntegerType
> >>> print _infer_type(2**71 )
> LongType
> {code}
> Java Primitive Types defined:
> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
> Python Built-in Types:
> https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.

2015-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5935.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

> Accept MapType in the schema provided to a JSON dataset.
> 
>
> Key: SPARK-5935
> URL: https://issues.apache.org/jira/browse/SPARK-5935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5124) Standardize internal RPC interface

2015-02-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334166#comment-14334166
 ] 

Marcelo Vanzin edited comment on SPARK-5124 at 2/24/15 1:08 AM:


Hi [~zsxwing], I briefly looked over the design and parts of the 
implementation, and I have some comments. Overall this looks OK, even if it 
kinda does look a lot like akka's actor system, but I don't think there would 
be a problem implementing it over a different RPC library.

One thing that I found very confusing is {{RpcEndpoint}} vs. 
{{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies 
networking. I think Reynold touched on this when he talked about using an event 
loop for "local actors". Perhaps a more flexible approach would be something 
like:

- A generic {{Endpoint}} that defines the {{receive()}} method and other 
generic interfaces
- {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote 
connections. Things like "onConnected" and "onDisassociated" become messages 
passed to {{receive()}}, kinda like in akka today. This means that there's no 
need for a specialized {{RpcEndpoint}} interface.
- "local" Endpoints could be exposed directly, without the need for an 
{{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some 
wrapper to expose {{send()}} and {{ask()}} so that the client interface looks 
similar for remote and local endpoints.
- The default {{Endpoint}} has no thread-safety guarantees. You can wrap an 
{{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a 
queue, or synchronize your {{receive()}} method (although that can block the 
dispatcher thread, which could be bad). But this would easily allow actors to 
process multiple messages concurrently if desired.

What do you think?




was (Author: vanzin):
Hi [~zsxwing], I briefly looked over the design and parts of the 
implementation, and I have some comments. Overall this looks OK, even if it 
kinda does look a lot like akka's actor system, but I don't think there would 
be a problem implementing it over a different RPC library.

One thing that I found very confusing is {{RpcEndpoint}} vs. 
{{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies 
networking. I think Reynold touched on this when he talked about using an event 
loop for "local actors". Perhaps a more flexible approach would be something 
like:

- A generic {{Endpoint}} that defines the {{receive()}} method and other 
generic interfaces
- {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote 
connections. Things like "onConnected" and "onDisassociated" become messages 
passed to {{receive()}}, kinda like in akka today. This means that there's no 
need for a specialized {{RpcEndpoint}} interface.
- "local" Endpoints could be exposed directly, without the need for an 
{{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some 
wrapper to expose {{send()}} and {{ask()}} so that the client interface looks 
similar for remote and local endpoints.
- The default {{Endpoint}} has no thread-safety guarantees. You can wrap an 
{{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a 
queue, or synchronize your {{receive()}} method (although that can block the 
dispatcher thread, which could be bad). But this would easily allow actors to 
process multiple messages concurrently is desired.

What do you think?



> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-02-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-5293:
--
Issue Type: Umbrella  (was: Improvement)

> Enable Spark user applications to use different versions of Akka
> 
>
> Key: SPARK-5293
> URL: https://issues.apache.org/jira/browse/SPARK-5293
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Reynold Xin
>
> A lot of Spark user applications are using (or want to use) Akka. Akka as a 
> whole can contribute great architectural simplicity and uniformity. However, 
> because Spark depends on Akka, it is not possible for users to rely on 
> different versions, and we have received many requests in the past asking for 
> help about this specific issue. For example, Spark Streaming might be used as 
> the receiver of Akka messages - but our dependency on Akka requires the 
> upstream Akka actors to also use the identical version of Akka.
> Since our usage of Akka is limited (mainly for RPC and single-threaded event 
> loop), we can replace it with alternative RPC implementations and a common 
> event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop

2015-02-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-5214:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5293

> Add EventLoop and change DAGScheduler to an EventLoop
> -
>
> Key: SPARK-5214
> URL: https://issues.apache.org/jira/browse/SPARK-5214
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.3.0
>
>
> As per discussion in SPARK-5124, DAGScheduler can simply use a queue & event 
> loop to process events. It would be great when we want to decouple Akka in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-02-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334195#comment-14334195
 ] 

Marcelo Vanzin commented on SPARK-5293:
---

I converted this bug to an umbrella so that it's easier to track sub-tasks. 
Hope that's ok for everybody.

> Enable Spark user applications to use different versions of Akka
> 
>
> Key: SPARK-5293
> URL: https://issues.apache.org/jira/browse/SPARK-5293
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Reynold Xin
>
> A lot of Spark user applications are using (or want to use) Akka. Akka as a 
> whole can contribute great architectural simplicity and uniformity. However, 
> because Spark depends on Akka, it is not possible for users to rely on 
> different versions, and we have received many requests in the past asking for 
> help about this specific issue. For example, Spark Streaming might be used as 
> the receiver of Akka messages - but our dependency on Akka requires the 
> upstream Akka actors to also use the identical version of Akka.
> Since our usage of Akka is limited (mainly for RPC and single-threaded event 
> loop), we can replace it with alternative RPC implementations and a common 
> event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5124) Standardize internal RPC interface

2015-02-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-5124:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5293

> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334168#comment-14334168
 ] 

Brennon York commented on SPARK-5312:
-

I'm not so sure the sbt {{show compile:discoveredMainClasses}} call will give 
the output we want in this case. For reference, from the documentation of that 
call states that "sbt detects the classes with public, static main methods for 
use by the run method" and, to replicate the grep/sed code, we'd need a sbt 
call to gather *all* {{public}} classes and not just the {{public static main}} 
classes. If the goal is merely to rework the grep/sed code I could see what I 
can do, but I'm not sure its going to get much better. I checked the sbt doc's 
for something resembling what we're looking for and it doesn't seem they have 
that feature. Thoughts?

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-02-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334166#comment-14334166
 ] 

Marcelo Vanzin commented on SPARK-5124:
---

Hi [~zsxwing], I briefly looked over the design and parts of the 
implementation, and I have some comments. Overall this looks OK, even if it 
kinda does look a lot like akka's actor system, but I don't think there would 
be a problem implementing it over a different RPC library.

One thing that I found very confusing is {{RpcEndpoint}} vs. 
{{NetworkRpcEndpoint}}. The "R" in RPC means "remote", which sort of implies 
networking. I think Reynold touched on this when he talked about using an event 
loop for "local actors". Perhaps a more flexible approach would be something 
like:

- A generic {{Endpoint}} that defines the {{receive()}} method and other 
generic interfaces
- {{RpcEnv}} takes an {{Endpoint}} and a name to register endpoints for remote 
connections. Things like "onConnected" and "onDisassociated" become messages 
passed to {{receive()}}, kinda like in akka today. This means that there's no 
need for a specialized {{RpcEndpoint}} interface.
- "local" Endpoints could be exposed directly, without the need for an 
{{RpcEnv}}. For example, as a getter in {{SparkEnv}}. You could have some 
wrapper to expose {{send()}} and {{ask()}} so that the client interface looks 
similar for remote and local endpoints.
- The default {{Endpoint}} has no thread-safety guarantees. You can wrap an 
{{Endpoint}} in an {{EventLoop}} if you want messages to be handled using a 
queue, or synchronize your {{receive()}} method (although that can block the 
dispatcher thread, which could be bad). But this would easily allow actors to 
process multiple messages concurrently is desired.

What do you think?



> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3436) Streaming SVM

2015-02-23 Thread Anand Iyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334161#comment-14334161
 ] 

Anand Iyer commented on SPARK-3436:
---

What is the original Jira (that this duplicates)?

> Streaming SVM 
> --
>
> Key: SPARK-3436
> URL: https://issues.apache.org/jira/browse/SPARK-3436
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liquan Pei
>
> Implement online learning with kernels according to 
> http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
> The algorithms proposed in the above paper are implemented in R 
> (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
> (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5957) Better handling of default parameter values.

2015-02-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5957:


 Summary: Better handling of default parameter values.
 Key: SPARK-5957
 URL: https://issues.apache.org/jira/browse/SPARK-5957
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng


We store the default value of a parameter in the Param instance. In many cases, 
the default value depends on the algorithm and other parameters defined in the 
same algorithm. We need to think a better approach to handle default parameter 
values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5956) Transformer/Estimator should be copyable.

2015-02-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5956:


 Summary: Transformer/Estimator should be copyable.
 Key: SPARK-5956
 URL: https://issues.apache.org/jira/browse/SPARK-5956
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Xiangrui Meng


In a pipeline, we don't save additional params specified in `fit()` to 
transformers, because we should not modify them. The current solution is to 
store training parameters in the pipeline model and apply those parameters at 
`transform()`. A better solution would be making transformers copyable. Calling 
`.copy` on a transformer produces a new transformer with a different UID but 
same parameters. Then we can use the copied transformers in the pipeline model, 
with additional params stored.

`copy` may not be a good name because it is not an exact copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5886) Add LabelIndexer

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334137#comment-14334137
 ] 

Apache Spark commented on SPARK-5886:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4735

> Add LabelIndexer
> 
>
> Key: SPARK-5886
> URL: https://issues.apache.org/jira/browse/SPARK-5886
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `LabelIndexer` takes a column of labels (raw categories) and outputs an 
> integer column with labels indexed by their frequency.
> {code}
> va li = new LabelIndexer()
>   .setInputCol("country")
>   .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML 
> attribute. The index should be ordered by frequency, where the most frequent 
> label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5886) Add LabelIndexer

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5886:
-
Assignee: Xiangrui Meng

> Add LabelIndexer
> 
>
> Key: SPARK-5886
> URL: https://issues.apache.org/jira/browse/SPARK-5886
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `LabelIndexer` takes a column of labels (raw categories) and outputs an 
> integer column with labels indexed by their frequency.
> {code}
> va li = new LabelIndexer()
>   .setInputCol("country")
>   .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML 
> attribute. The index should be ordered by frequency, where the most frequent 
> label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-02-23 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334107#comment-14334107
 ] 

ding commented on SPARK-4105:
-

I am taking annual leave from 2/14 - 2/26. Sorry to bring the inconvience.

Best regards
DingDing


> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {code}
> java.io.IOException: fai

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-02-23 Thread Yashwanth Rao Dannamaneni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334059#comment-14334059
 ] 

Yashwanth Rao Dannamaneni commented on SPARK-4105:
--

I have a similar issue. I have a job which occasionally fails with this error. 
I am using Spark 1.2.0 with HDP 2.1.1. 

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occ

[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334049#comment-14334049
 ] 

Brennon York commented on SPARK-4123:
-

[~nchammas] have you started this? If not I can take this off your hands, just 
let me know!

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334039#comment-14334039
 ] 

Sandy Ryza commented on SPARK-5490:
---

The relevant JIRA is SPARK-732, but it's marked as "Won't Fix", so it's unclear 
whether that's the solution.

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>  Labels: clustering
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2336:
-
Labels: clustering features  (was: clustering features newbie)

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2308:
-
Labels: clustering  (was: )

> Add KMeans MiniBatch clustering algorithm to MLlib
> --
>
> Key: SPARK-2308
> URL: https://issues.apache.org/jira/browse/SPARK-2308
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: clustering
> Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
> data points in each iteration instead of the full set of data points, 
> improving performance (and in some cases, accuracy).  The mini-batch version 
> is compatible with the KMeans|| initialization algorithm currently 
> implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2344:
-
Labels: clustering  (was: )

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3261:
-
Labels: clustering  (was: )

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2335:
-
Labels: clustering features  (was: features)

> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2336:
-
Labels: clustering features newbie  (was: features newbie)

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2138:
-
Labels: clustering  (was: )

> The KMeans algorithm in the MLlib can lead to the Serialized Task size become 
> bigger and bigger
> ---
>
> Key: SPARK-2138
> URL: https://issues.apache.org/jira/browse/SPARK-2138
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 0.9.0, 0.9.1
>Reporter: DjvuLee
>Assignee: Xiangrui Meng
>  Labels: clustering
>
> When the algorithm running at certain stage, when running the reduceBykey() 
> function, It can lead to Executor Lost and Task lost, after several times. 
> the application exit.
> When this error occurred, the size of serialized task is bigger than 10MB, 
> and the size become larger as the iteration increase.
> the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622
> the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3504:
-
Labels: clustering  (was: )

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> --
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>  Labels: clustering
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334038#comment-14334038
 ] 

Brennon York commented on SPARK-3850:
-

This made it into the [master 
branch|https://github.com/apache/spark/blob/master/scalastyle-config.xml#L54] 
at some point already. We can close this issue.

/cc [~srowen] [~pwendell]

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using {{WhitespaceEndOfLineChecker}} here: 
> http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3219:
-
Labels: clustering  (was: )

> K-Means clusterer should support Bregman distance functions
> ---
>
> Key: SPARK-3219
> URL: https://issues.apache.org/jira/browse/SPARK-3219
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The K-Means clusterer supports the Euclidean distance metric.  However, it is 
> rather straightforward to support Bregman 
> (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
> distance functions which would increase the utility of the clusterer 
> tremendously.
> I have modified the clusterer to support pluggable distance functions.  
> However, I notice that there are hundreds of outstanding pull requests.  If 
> someone is willing to work with me to sponsor the work through the process, I 
> will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2429:
-
Labels: clustering  (was: )

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3439) Add Canopy Clustering Algorithm

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3439:
-
Labels: clustering  (was: )

> Add Canopy Clustering Algorithm
> ---
>
> Key: SPARK-3439
> URL: https://issues.apache.org/jira/browse/SPARK-3439
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Assignee: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: clustering
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3218:
-
Labels: clustering  (was: )

> K-Means clusterer can fail on degenerate data
> -
>
> Key: SPARK-3218
> URL: https://issues.apache.org/jira/browse/SPARK-3218
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The KMeans parallel implementation selects points to be cluster centers with 
> probability weighted by their distance to cluster centers.  However, if there 
> are fewer than k DISTINCT points in the data set, this approach will fail.  
> Further, the recent checkin to work around this problem results in selection 
> of the same point repeatedly as a cluster center. 
> The fix is to allow fewer than k cluster centers to be selected.  This 
> requires several changes to the code, as the number of cluster centers is 
> woven into the implementation.
> I have a version of the code that addresses this problem, AND generalizes the 
> distance metric.  However, I see that there are literally hundreds of 
> outstanding pull requests.  If someone will commit to working with me to 
> sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3220:
-
Labels: clustering  (was: )

> K-Means clusterer should perform K-Means initialization in parallel
> ---
>
> Key: SPARK-3220
> URL: https://issues.apache.org/jira/browse/SPARK-3220
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Derrick Burns
>  Labels: clustering
>
> The LocalKMeans method should be replaced with a parallel implementation.  As 
> it stands now, it becomes a bottleneck for large data sets. 
> I have implemented this functionality in my version of the clusterer.  
> However, I see that there are hundreds of outstanding pull requests.  If 
> someone on the team wants to sponsor the pull request, I will create one.  
> Otherwise, I will just maintain my own private fork of the clusterer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4039:
-
Labels: clustering  (was: )

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>  Labels: clustering
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5272:
-
Labels: clustering  (was: )

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5490:
-
Labels: clustering  (was: )

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>  Labels: clustering
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334037#comment-14334037
 ] 

Derrick Burns commented on SPARK-5405:
--

Agreed.

On Mon, Feb 23, 2015 at 2:48 PM, Xiangrui Meng (JIRA) 



> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5016:
-
Labels: clustering  (was: )

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---
>
> Key: SPARK-5016
> URL: https://issues.apache.org/jira/browse/SPARK-5016
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5927) Modify FPGrowth's partition strategy to reduce transactions in partitions

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5927.

Resolution: Won't Fix

I'm closing this JIRA per discussion on the PR page.

> Modify FPGrowth's partition strategy to reduce transactions in partitions
> -
>
> Key: SPARK-5927
> URL: https://issues.apache.org/jira/browse/SPARK-5927
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5832:
-
Labels: clustering  (was: )

> Add Affinity Propagation clustering algorithm
> -
>
> Key: SPARK-5832
> URL: https://issues.apache.org/jira/browse/SPARK-5832
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: clustering
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5490:
-
Target Version/s: 1.4.0

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334030#comment-14334030
 ] 

Xiangrui Meng commented on SPARK-5490:
--

[~sandyr] This is a bug in core. Could you link the JIRA? 

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334027#comment-14334027
 ] 

Xiangrui Meng commented on SPARK-5405:
--

Dimension reduction should be separated from the k-means implementation. We can 
add distance-preserving methods as feature transformers. [~derrickburns] Could 
you update the JIRA title?

> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334025#comment-14334025
 ] 

Brennon York commented on SPARK-1182:
-

Given [~joshrosen]'s comments on the PR making merge-conflict hell, would it be 
better just to scratch this as an issue and close everything out? Its either 
that or deal with all the merge conflicts for any / all backports moving 
forward. Thoughts?

> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-4039:
--

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4039.

Resolution: Duplicate

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5405:
-
Labels: clustering  (was: )

> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334021#comment-14334021
 ] 

Xiangrui Meng commented on SPARK-5261:
--

Could you try a larger minCount to reduce the vocabulary size?

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334020#comment-14334020
 ] 

Brennon York commented on SPARK-794:


[~srowen] [~joshrosen] bump on this. Would assume things are stable with the 
removal of the sleep method, but want to double check. Thinking we can close 
this ticket out.

> Remove sleep() in ClusterScheduler.stop
> ---
>
> Key: SPARK-794
> URL: https://issues.apache.org/jira/browse/SPARK-794
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5226:
-
Labels: DBSCAN clustering  (was: DBSCAN)

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats

2015-02-23 Thread Magellanea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334013#comment-14334013
 ] 

Magellanea commented on SPARK-5940:
---

[~lukovnikov] Thanks a lot for the reply, Do you know who can I mention/tag in 
this discussion from the Spark/GraphX team?

Thanks

> Graph Loader: refactor + add more formats
> -
>
> Key: SPARK-5940
> URL: https://issues.apache.org/jira/browse/SPARK-5940
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>Priority: Minor
>
> Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] 
> adds a RDF graph loader.
> However, as Takeshi Yamamuro suggested on github [SPARK-5280], 
> https://github.com/apache/spark/pull/4650, it might be interesting to make 
> GraphLoader an interface with several implementations for different formats. 
> And maybe it's good to make a façade graph loader that provides a unified 
> interface to all loaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5010) native openblas library doesn't work: undefined symbol: cblas_dscal

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5010.

Resolution: Not a Problem

I'm closing this PR because it is a upstream issue with the native BLAS library.

> native openblas library doesn't work: undefined symbol: cblas_dscal
> ---
>
> Key: SPARK-5010
> URL: https://issues.apache.org/jira/browse/SPARK-5010
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: standalone
>Reporter: Tomas Hudik
>Priority: Minor
>  Labels: mllib, openblas
>
> 1. compiled and installed open blas library
> 2, ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3
> 3. compiled and built spark:
> mvn -Pnetlib-lgpl -DskipTests clean compile package
> 4. run: bin/run-example  mllib.LinearRegression 
> data/mllib/sample_libsvm_data.txt
> 14/12/30 18:39:57 INFO BlockManagerMaster: Trying to register BlockManager
> 14/12/30 18:39:57 INFO BlockManagerMasterActor: Registering block manager 
> localhost:34297 with 265.1 MB RAM, BlockManagerId(, localhost, 34297)
> 14/12/30 18:39:57 INFO BlockManagerMaster: Registered BlockManager
> 14/12/30 18:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/12/30 18:39:58 WARN LoadSnappy: Snappy native library not loaded
> Training: 80, test: 20.
> /usr/local/lib/jdk1.8.0//bin/java: symbol lookup error: 
> /tmp/jniloader1826801168744171087netlib-native_system-linux-x86_64.so: 
> undefined symbol: cblas_dscal
> I followed guide: https://spark.apache.org/docs/latest/mllib-guide.html 
> section dependencies.
> Am I missing something?
> How to force Spark to use openblas library?
> Thanks, Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >