[jira] [Updated] (SPARK-2227) Support dfs command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2227: --- Priority: Minor (was: Major) Support dfs command - Key: SPARK-2227 URL: https://issues.apache.org/jira/browse/SPARK-2227 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections
[ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039742#comment-14039742 ] Mridul Muralidharan commented on SPARK-704: --- If remote node goes down, SendingConnection would be notified since it is also registered for read events (to handle precisely this case actually). ReceivingConnection would anyway be notified since it is waiting on reads on that socket. This, ofcourse, assumes that local node detects remote node failure at tcp layer. Problems come in when ConnectionManager sometimes cannot detect loss of sending connections - Key: SPARK-704 URL: https://issues.apache.org/jira/browse/SPARK-704 Project: Spark Issue Type: Bug Reporter: Charles Reiss Assignee: Henry Saputra ConnectionManager currently does not detect when SendingConnections disconnect except if it is trying to send through them. As a result, a node failure just after a connection is initiated but before any acknowledgement messages can be sent may result in a hang. ConnectionManager has code intended to detect this case by detecting the failure of a corresponding ReceivingConnection, but this code assumes that the remote host:port of the ReceivingConnection is the same as the ConnectionManagerId, which is almost never true. Additionally, there does not appear to be any reason to assume a corresponding ReceivingConnection will exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections
[ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039742#comment-14039742 ] Mridul Muralidharan edited comment on SPARK-704 at 6/21/14 9:10 AM: If remote node goes down, SendingConnection would be notified since it is also registered for read events (to handle precisely this case actually). ReceivingConnection would be notified since it is waiting on reads on that socket. This, ofcourse, assumes that local node detects remote node failure at tcp layer. Problems come in when this is not detected due to no activity on the socket (at app and socket level - keepalive timeout, etc). Usually this is detected via application level ping/keepalive messages : not sure if we want to introduce that into spark ... was (Author: mridulm80): If remote node goes down, SendingConnection would be notified since it is also registered for read events (to handle precisely this case actually). ReceivingConnection would anyway be notified since it is waiting on reads on that socket. This, ofcourse, assumes that local node detects remote node failure at tcp layer. Problems come in when ConnectionManager sometimes cannot detect loss of sending connections - Key: SPARK-704 URL: https://issues.apache.org/jira/browse/SPARK-704 Project: Spark Issue Type: Bug Reporter: Charles Reiss Assignee: Henry Saputra ConnectionManager currently does not detect when SendingConnections disconnect except if it is trying to send through them. As a result, a node failure just after a connection is initiated but before any acknowledgement messages can be sent may result in a hang. ConnectionManager has code intended to detect this case by detecting the failure of a corresponding ReceivingConnection, but this code assumes that the remote host:port of the ReceivingConnection is the same as the ConnectionManagerId, which is almost never true. Additionally, there does not appear to be any reason to assume a corresponding ReceivingConnection will exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3
[ https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039797#comment-14039797 ] Sean Owen commented on SPARK-1568: -- Sam, did the other recent changes to S3 deps resolve this, do you think? Spark 0.9.0 hangs reading s3 Key: SPARK-1568 URL: https://issues.apache.org/jira/browse/SPARK-1568 Project: Spark Issue Type: Bug Reporter: sam I've tried several jobs now and many of the tasks complete, then it get stuck and just hangs. The exact same jobs function perfectly fine if I distcp to hdfs first and read from hdfs. Many thanks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039802#comment-14039802 ] Sean Owen commented on SPARK-2223: -- On a latest-generation Macbook Pro here, a full 'mvn clean install' takes 91:50 without zinc. With zinc, it's 51:02. Building and running tests with maven is extremely slow --- Key: SPARK-2223 URL: https://issues.apache.org/jira/browse/SPARK-2223 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves For some reason using maven with Spark is extremely slow. Building and running tests takes way longer then other projects I have used that use maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1339) Build error: org.eclipse.paho:mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039804#comment-14039804 ] Sean Owen commented on SPARK-1339: -- (Just cruising some old issues) I can't reproduce this, and this is a general symptom of a repo not being accesible. It's actually nothing to do with mqtt-client per se. Also, we've fixed some repo issues along the way. Build error: org.eclipse.paho:mqtt-client - Key: SPARK-1339 URL: https://issues.apache.org/jira/browse/SPARK-1339 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Ken Williams Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. The Maven error is: {code} [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus {code} My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4. Is there an additional Maven repository I should add or something? If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, but I would really like to get the examples working because I haven't played with Spark before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS
[ https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039805#comment-14039805 ] Sean Owen commented on SPARK-1138: -- This is no longer observed in the unit tests. The comments here say it was a Netty dependency problem, and I know that has since been cleaned up. Suggest this is resolved then? Spark 0.9.0 does not work with Hadoop / HDFS Key: SPARK-1138 URL: https://issues.apache.org/jira/browse/SPARK-1138 Project: Spark Issue Type: Bug Reporter: Sam Abeyratne UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and the latest cloudera Hadoop / HDFS in the same jar. It seems no matter how I fiddle with the deps, the do not play nice together. I'm getting a java.util.concurrent.TimeoutException when trying to create a spark context with 0.9. I cannot, whatever I do, change the timeout. I've tried using System.setProperty, the SparkConf mechanism of creating a SparkContext and the -D flags when executing my jar. I seem to be able to run simple jobs from the spark-shell OK, but my more complicated jobs require external libraries so I need to build jars and execute them. Some code that causes this: println(Creating config) val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(MyApp) .setSparkHome(sparkHome) .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100)) .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100)) println(Creating sc) implicit val sc = new SparkContext(conf) The output: Creating config Creating sc log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup timed out] [ akka.remote.RemoteTransportException: Startup timed out at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:191) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) ... 11 more ] Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at
[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3
[ https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039854#comment-14039854 ] sam commented on SPARK-1568: When we upgrade to 1.0.0 I'll test this. This particular problem was from quite a while back when our cluster was quite different from it is now. At the moment we get the jets3 thing, which is supposed to go away in 1.0.0. Spark 0.9.0 hangs reading s3 Key: SPARK-1568 URL: https://issues.apache.org/jira/browse/SPARK-1568 Project: Spark Issue Type: Bug Reporter: sam I've tried several jobs now and many of the tasks complete, then it get stuck and just hangs. The exact same jobs function perfectly fine if I distcp to hdfs first and read from hdfs. Many thanks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2212) Hash Outer Joins
[ https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2212: Summary: Hash Outer Joins (was: HashJoin) Hash Outer Joins Key: SPARK-2212 URL: https://issues.apache.org/jira/browse/SPARK-2212 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2212) Hash Outer Joins
[ https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2212: Priority: Major (was: Critical) Hash Outer Joins Key: SPARK-2212 URL: https://issues.apache.org/jira/browse/SPARK-2212 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2214) Broadcast Join (aka map join)
[ https://issues.apache.org/jira/browse/SPARK-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-2214. --- Resolution: Duplicate Broadcast Join (aka map join) - Key: SPARK-2214 URL: https://issues.apache.org/jira/browse/SPARK-2214 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1800) Add broadcast hash join operator
[ https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1800: Issue Type: Sub-task (was: Improvement) Parent: SPARK-2211 Add broadcast hash join operator Key: SPARK-1800 URL: https://issues.apache.org/jira/browse/SPARK-1800 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2195) Parquet extraMetadata can contain key information
[ https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039915#comment-14039915 ] Michael Armbrust commented on SPARK-2195: - Yeah, thanks for taking care of this so quickly! Parquet extraMetadata can contain key information - Key: SPARK-2195 URL: https://issues.apache.org/jira/browse/SPARK-2195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Priority: Blocker Fix For: 1.0.1, 1.1.0 {code} 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same as file fileSchema: message root { optional int32 id; optional binary title; optional int64 modified; optional binary text; optional binary username; } extraMetadata: {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), StructField(title,StringType,true), StructField(modified,LongType,true), StructField(text,StringType,true), StructField(username,StringType,true))), path= MY AWS KEYS!!! } readSupportMetadata: {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), StructField(title,StringType,true), StructField(modified,LongType,true), StructField(text,StringType,true), StructField(username,StringType,true))), path= MY AWS KEYS ***}} {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2195) Parquet extraMetadata can contain key information
[ https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2195. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Parquet extraMetadata can contain key information - Key: SPARK-2195 URL: https://issues.apache.org/jira/browse/SPARK-2195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Priority: Blocker Fix For: 1.0.1, 1.1.0 {code} 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same as file fileSchema: message root { optional int32 id; optional binary title; optional int64 modified; optional binary text; optional binary username; } extraMetadata: {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), StructField(title,StringType,true), StructField(modified,LongType,true), StructField(text,StringType,true), StructField(username,StringType,true))), path= MY AWS KEYS!!! } readSupportMetadata: {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), StructField(title,StringType,true), StructField(modified,LongType,true), StructField(text,StringType,true), StructField(username,StringType,true))), path= MY AWS KEYS ***}} {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2227) Support dfs command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-2227: - Sorry, i'll reopen this since you already have a PR with just this change. Support dfs command - Key: SPARK-2227 URL: https://issues.apache.org/jira/browse/SPARK-2227 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Minor Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2220) Fix remaining Hive Commands
[ https://issues.apache.org/jira/browse/SPARK-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2220: Description: None of the following have an execution plan: {code} private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} dfs is being fixed in a related PR. was: None of the following have an execution plan: {code} private[hive] case class DfsCommand(cmd: String) extends Command private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} Fix remaining Hive Commands --- Key: SPARK-2220 URL: https://issues.apache.org/jira/browse/SPARK-2220 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Fix For: 1.1.0 None of the following have an execution plan: {code} private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} dfs is being fixed in a related PR. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
[ https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039937#comment-14039937 ] Ted Malaska commented on SPARK-1478: OK I have made the changes requested. But I had to do it in a different pull request. Here is the new pull request link https://github.com/apache/spark/pull/1168 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 --- Key: SPARK-1478 URL: https://issues.apache.org/jira/browse/SPARK-1478 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor Fix For: 1.1.0 Flume-1915 added support for compression over the wire from avro sink to avro source. I would like to add this functionality to the FlumeReceiver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039944#comment-14039944 ] Sean Owen commented on SPARK-1675: -- Is this still valid? Looking at the code, PCA is computed as the SVD of the covariance matrix. The means implicitly don't matter. they are not explicitly subtracted, and do not matter. Or is there still a doc change desired? Make clear whether computePrincipalComponents centers data -- Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings
[ https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039946#comment-14039946 ] Patrick Wendell commented on SPARK-1392: I mentioned this on the pull request, but I think this was an instance of SPARK-1777. I'm running some tests locally on the pull request there to determine whether that was the case. Local spark-shell Runs Out of Memory With Default Settings -- Key: SPARK-1392 URL: https://issues.apache.org/jira/browse/SPARK-1392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3 Reporter: Pat McDonough Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1846) RAT checks should exclude logs/ directory
[ https://issues.apache.org/jira/browse/SPARK-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039945#comment-14039945 ] Sean Owen commented on SPARK-1846: -- Just looking over some old JIRAs. This appears to be resolved already. logs is excluded. RAT checks should exclude logs/ directory - Key: SPARK-1846 URL: https://issues.apache.org/jira/browse/SPARK-1846 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Andrew Ash When there are logs in the logs/ directory, the rat check from ./dev/check-license fails. ``` aash@aash-mbp ~/git/spark$ find logs -type f logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ./dev/check-license Could not find Apache license headers in the following files: !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ``` -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1804) Mark 0.9.1 as released in JIRA
[ https://issues.apache.org/jira/browse/SPARK-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039947#comment-14039947 ] Sean Owen commented on SPARK-1804: -- Looks like this can be closed as resolved. https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel Mark 0.9.1 as released in JIRA -- Key: SPARK-1804 URL: https://issues.apache.org/jira/browse/SPARK-1804 Project: Spark Issue Type: Task Components: Documentation, Project Infra Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial 0.9.1 has been released but is labeled as unreleased in SPARK JIRA project. Please have it marked as released. Also please document that step in release process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS
[ https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039948#comment-14039948 ] Sean Owen commented on SPARK-1803: -- PR was committed so this is another that seems to be closeable. Rename test resources to be compatible with Windows FS -- Key: SPARK-1803 URL: https://issues.apache.org/jira/browse/SPARK-1803 Project: Spark Issue Type: Task Components: Windows Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial {{git clone}} of master branch and then {{git status}} on Windows reports untracked files: {noformat} # Untracked files: # (use git add file... to include in what will be committed) # # sql/hive/src/test/resources/golden/Column pruning # sql/hive/src/test/resources/golden/Partition pruning # sql/hive/src/test/resources/golden/Partiton pruning {noformat} Actual issue is that several files under {{sql/hive/src/test/resources/golden}} directory have colon in name which is invalid character in file name on Windows. Please have these files renamed to a Windows compatible file name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1046) Enable to build behind a proxy.
[ https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039950#comment-14039950 ] Sean Owen commented on SPARK-1046: -- Is this stale / resolved? I don't see this in the code at this point. Enable to build behind a proxy. --- Key: SPARK-1046 URL: https://issues.apache.org/jira/browse/SPARK-1046 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.8.1 Reporter: Kousuke Saruta Priority: Minor I tried to build spark-0.8.1 behind proxy and failed although I set http/https.proxyHost, proxyPort, proxyUser, proxyPassword. I found it's caused by accessing github using git protocol (git://). The URL is hard-corded in SparkPluginBuild.scala as follows. {code} lazy val junitXmlListener = uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce) {code} After I rewrite the URL as follows, I could build successfully. {code} lazy val junitXmlListener = uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;) {code} I think we should be able to build whether we are behind a proxy or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-1675: -- Summary: Make clear whether computePrincipalComponents requires centered data (was: Make clear whether computePrincipalComponents centers data) Make clear whether computePrincipalComponents requires centered data Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-721) Fix remaining deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039951#comment-14039951 ] Sean Owen commented on SPARK-721: - This appears to be resolved as I don't think these warnings have been in the build for a while. Fix remaining deprecation warnings -- Key: SPARK-721 URL: https://issues.apache.org/jira/browse/SPARK-721 Project: Spark Issue Type: Improvement Affects Versions: 0.7.1 Reporter: Josh Rosen Assignee: Gary Struthers Priority: Minor Labels: Starter The recent patch to re-enable deprecation warnings fixed many of them, but there's still a few left; it would be nice to fix them. For example, here's one in RDDSuite: {code} [warn] /Users/joshrosen/Documents/spark/spark/core/src/test/scala/spark/RDDSuite.scala:32: method mapPartitionsWithSplit in class RDD is deprecated: use mapPartitionsWithIndex [warn] val partitionSumsWithSplit = nums.mapPartitionsWithSplit { [warn] ^ [warn] one warning found {code} Also, it looks like Scala 2.9 added a second deprecatedSince parameter to @Deprecated. We didn't fill this in, which causes some additional warnings: {code} [warn] /Users/joshrosen/Documents/spark/spark/core/src/main/scala/spark/RDD.scala:370: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(use mapPartitionsWithIndex) [warn]^ [warn] one warning found {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039952#comment-14039952 ] Sandy Ryza commented on SPARK-1675: --- I think it still wouldn't hurt to add a remark that input data doesn't need to be centered. Should have marked this trivial when I filed it - not a big deal either way. Make clear whether computePrincipalComponents requires centered data Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-1675: -- Priority: Trivial (was: Major) Make clear whether computePrincipalComponents requires centered data Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings
[ https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039953#comment-14039953 ] Patrick Wendell commented on SPARK-1392: Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows: {code} SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly/assembly sc.textFile(/tmp/wiki_links).cache.count {code} The wiki_links file was download and extracted from here: This worked with the proposed patch but failed with the default build. Local spark-shell Runs Out of Memory With Default Settings -- Key: SPARK-1392 URL: https://issues.apache.org/jira/browse/SPARK-1392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3 Reporter: Pat McDonough Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings
[ https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039953#comment-14039953 ] Patrick Wendell edited comment on SPARK-1392 at 6/21/14 9:15 PM: - Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows: {code} SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly/assembly sc.textFile(/tmp/wiki_links).cache.count {code} The wiki_links file was download and extracted from here: https://drive.google.com/file/d/0BwrkCxCycBCyTmlWYXp0MmdEakk/edit?usp=sharing This worked with the proposed patch but failed with the default build. was (Author: pwendell): Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows: {code} SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly/assembly sc.textFile(/tmp/wiki_links).cache.count {code} The wiki_links file was download and extracted from here: This worked with the proposed patch but failed with the default build. Local spark-shell Runs Out of Memory With Default Settings -- Key: SPARK-1392 URL: https://issues.apache.org/jira/browse/SPARK-1392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3 Reporter: Pat McDonough Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings
[ https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1392. Resolution: Duplicate Local spark-shell Runs Out of Memory With Default Settings -- Key: SPARK-1392 URL: https://issues.apache.org/jira/browse/SPARK-1392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3 Reporter: Pat McDonough Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1996) Remove use of special Maven repo for Akka
[ https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039965#comment-14039965 ] Sean Owen commented on SPARK-1996: -- PR: https://github.com/apache/spark/pull/1170 Remove use of special Maven repo for Akka - Key: SPARK-1996 URL: https://issues.apache.org/jira/browse/SPARK-1996 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Matei Zaharia Fix For: 1.0.1 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html Akka is now published to Maven Central, so our documentation and POM files don't need to use the old Akka repo. It will be one less step for users to worry about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted
Baoxu Shi created SPARK-2228: Summary: onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted Key: SPARK-2228 URL: https://issues.apache.org/jira/browse/SPARK-2228 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Baoxu Shi We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not see code filed in the system so I put the code on gist. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-2228: - Description: We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not know who to put code here with highlight, so I put the code on gist to make the issue looks clean. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd was: We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not see code filed in the system so I put the code on gist. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted - Key: SPARK-2228 URL: https://issues.apache.org/jira/browse/SPARK-2228 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Baoxu Shi We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not know who to put code here with highlight, so I put the code on gist to make the issue looks clean. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-2228: - Summary: onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted (was: onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted - Key: SPARK-2228 URL: https://issues.apache.org/jira/browse/SPARK-2228 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Baoxu Shi We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not know who to put code here with highlight, so I put the code on gist to make the issue looks clean. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039988#comment-14039988 ] Jun Xie commented on SPARK-: Hi, Alexander. I would like to take up this task. Would you mind assigning this feature to me? Right now, I am doing multi-classification. So I have those concepts you mentioned and would like to implement this in Spark MLLib. Thanks very much. Add multiclass evaluation metrics - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alexander Ulanov There is no class in Spark MLlib for measuring the performance of multiclass classifiers. This task involves adding such class and unit tests. The following measures are to be implemented: per class, micro averaged and weighted averaged Precision, Recall and F1-Measure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1316) Remove use of Commons IO
[ https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039991#comment-14039991 ] Sean Owen commented on SPARK-1316: -- PR: https://github.com/apache/spark/pull/1173 Actually, Commons IO is not even a dependency right now. Remove use of Commons IO Key: SPARK-1316 URL: https://issues.apache.org/jira/browse/SPARK-1316 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor (This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 ) Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark Utils.scala class. Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li closed SPARK-1698. -- Resolution: Won't Fix Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.1.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of a jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2229) SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2229: --- Summary: SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException (was: SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException) SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException -- Key: SPARK-2229 URL: https://issues.apache.org/jira/browse/SPARK-2229 Project: Spark Issue Type: Bug Components: Spark Core Environment: JDK6 Mac OS x Reporter: Guoqiang Li Priority: Blocker [RollingPolicy.scala#L112|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L112] error in JDK 6 {code} import java.text.SimpleDateFormat import java.util.Calendar val formatter = new SimpleDateFormat(---MM-dd--HH-mm-ss--) {code} = {code} val formatter = new SimpleDateFormat(---MM-dd--HH-mm-ss--) java.lang.IllegalArgumentException: Illegal pattern character 'Y' at java.text.SimpleDateFormat.compile(SimpleDateFormat.java:768) at java.text.SimpleDateFormat.initialize(SimpleDateFormat.java:575) at java.text.SimpleDateFormat.init(SimpleDateFormat.java:500) at java.text.SimpleDateFormat.init(SimpleDateFormat.java:475) at .init(console:9) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040034#comment-14040034 ] Jun Xie commented on SPARK-: Nice work. I am reading the implementation of MulticlassMetrics. According to your code, for Micro average, you calculate the recall and then let precision and f1 measure equal to the recall. I am not sure whether this makes sense. According to this post: http://rushdishams.blogspot.com/2011/08/micro-and-macro-average-of-precision.html Assume we just have three classes. For each class, we have three numbers, true positive(tp), false positive(fp), false negative(fn). Hence, we have tp1, fp1 and fn1 for class 1. so on so forth. For Micro-Average Precision: (tp1 + tp2 + tp3) / (tp1 + tp2 + tp3 + fp1 + fp2 + fp3) For Micro-Average Recall: (tp1 + tp2 + tp3) / (tp1 + tp2 + tp3 + fn1 + fn2 + fn3) For Micro-Average F1Measure: it is just the harmonic mean of precision and recall. Based on the above definition, recall and precision should not be the same. Is it correct? Add multiclass evaluation metrics - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alexander Ulanov There is no class in Spark MLlib for measuring the performance of multiclass classifiers. This task involves adding such class and unit tests. The following measures are to be implemented: per class, micro averaged and weighted averaged Precision, Recall and F1-Measure. -- This message was sent by Atlassian JIRA (v6.2#6252)