[jira] [Updated] (SPARK-2227) Support dfs command

2014-06-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2227:
---

Priority: Minor  (was: Major)

 Support dfs command
 -

 Key: SPARK-2227
 URL: https://issues.apache.org/jira/browse/SPARK-2227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Priority: Minor

 Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections

2014-06-21 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039742#comment-14039742
 ] 

Mridul Muralidharan commented on SPARK-704:
---

If remote node goes down, SendingConnection would be notified since it is also 
registered for read events (to handle precisely this case actually).
ReceivingConnection would anyway be notified since it is waiting on reads on 
that socket.

This, ofcourse, assumes that local node detects remote node failure at tcp 
layer.
Problems come in when 

 ConnectionManager sometimes cannot detect loss of sending connections
 -

 Key: SPARK-704
 URL: https://issues.apache.org/jira/browse/SPARK-704
 Project: Spark
  Issue Type: Bug
Reporter: Charles Reiss
Assignee: Henry Saputra

 ConnectionManager currently does not detect when SendingConnections 
 disconnect except if it is trying to send through them. As a result, a node 
 failure just after a connection is initiated but before any acknowledgement 
 messages can be sent may result in a hang.
 ConnectionManager has code intended to detect this case by detecting the 
 failure of a corresponding ReceivingConnection, but this code assumes that 
 the remote host:port of the ReceivingConnection is the same as the 
 ConnectionManagerId, which is almost never true. Additionally, there does not 
 appear to be any reason to assume a corresponding ReceivingConnection will 
 exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections

2014-06-21 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039742#comment-14039742
 ] 

Mridul Muralidharan edited comment on SPARK-704 at 6/21/14 9:10 AM:


If remote node goes down, SendingConnection would be notified since it is also 
registered for read events (to handle precisely this case actually).
ReceivingConnection would be notified since it is waiting on reads on that 
socket.

This, ofcourse, assumes that local node detects remote node failure at tcp 
layer.
Problems come in when this is not detected due to no activity on the socket (at 
app and socket level - keepalive timeout, etc).
Usually this is detected via application level ping/keepalive messages :  not 
sure if we want to introduce that into spark ...


was (Author: mridulm80):
If remote node goes down, SendingConnection would be notified since it is also 
registered for read events (to handle precisely this case actually).
ReceivingConnection would anyway be notified since it is waiting on reads on 
that socket.

This, ofcourse, assumes that local node detects remote node failure at tcp 
layer.
Problems come in when 

 ConnectionManager sometimes cannot detect loss of sending connections
 -

 Key: SPARK-704
 URL: https://issues.apache.org/jira/browse/SPARK-704
 Project: Spark
  Issue Type: Bug
Reporter: Charles Reiss
Assignee: Henry Saputra

 ConnectionManager currently does not detect when SendingConnections 
 disconnect except if it is trying to send through them. As a result, a node 
 failure just after a connection is initiated but before any acknowledgement 
 messages can be sent may result in a hang.
 ConnectionManager has code intended to detect this case by detecting the 
 failure of a corresponding ReceivingConnection, but this code assumes that 
 the remote host:port of the ReceivingConnection is the same as the 
 ConnectionManagerId, which is almost never true. Additionally, there does not 
 appear to be any reason to assume a corresponding ReceivingConnection will 
 exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039797#comment-14039797
 ] 

Sean Owen commented on SPARK-1568:
--

Sam, did the other recent changes to S3 deps resolve this, do you think?

 Spark 0.9.0 hangs reading s3
 

 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've tried several jobs now and many of the tasks complete, then it get stuck 
 and just hangs.  The exact same jobs function perfectly fine if I distcp to 
 hdfs first and read from hdfs.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039802#comment-14039802
 ] 

Sean Owen commented on SPARK-2223:
--

On a latest-generation Macbook Pro here, a full 'mvn clean install' takes 91:50 
without zinc. With zinc, it's 51:02.

 Building and running tests with maven is extremely slow
 ---

 Key: SPARK-2223
 URL: https://issues.apache.org/jira/browse/SPARK-2223
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves

 For some reason using maven with Spark is extremely slow.  Building and 
 running tests takes way longer then other projects I have used that use 
 maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1339) Build error: org.eclipse.paho:mqtt-client

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039804#comment-14039804
 ] 

Sean Owen commented on SPARK-1339:
--

(Just cruising some old issues) I can't reproduce this, and this is a general 
symptom of a repo not being accesible. It's actually nothing to do with 
mqtt-client per se. Also, we've fixed some repo issues along the way.

 Build error: org.eclipse.paho:mqtt-client
 -

 Key: SPARK-1339
 URL: https://issues.apache.org/jira/browse/SPARK-1339
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Ken Williams

 Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded.  
 The Maven error is:
 {code}
 [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
 resolve dependencies for project 
 org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
 artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
 {code}
 My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
 Is there an additional Maven repository I should add or something?
 If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
 {{examples}} modules, the build succeeds.  I'm fine without the MQTT stuff, 
 but I would really like to get the examples working because I haven't played 
 with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039805#comment-14039805
 ] 

Sean Owen commented on SPARK-1138:
--

This is no longer observed in the unit tests. The comments here say it was a 
Netty dependency problem, and I know that has since been cleaned up. Suggest 
this is resolved then?

 Spark 0.9.0 does not work with Hadoop / HDFS
 

 Key: SPARK-1138
 URL: https://issues.apache.org/jira/browse/SPARK-1138
 Project: Spark
  Issue Type: Bug
Reporter: Sam Abeyratne

 UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and 
 the latest cloudera Hadoop / HDFS in the same jar.  It seems no matter how I 
 fiddle with the deps, the do not play nice together.
 I'm getting a java.util.concurrent.TimeoutException when trying to create a 
 spark context with 0.9.  I cannot, whatever I do, change the timeout.  I've 
 tried using System.setProperty, the SparkConf mechanism of creating a 
 SparkContext and the -D flags when executing my jar.  I seem to be able to 
 run simple jobs from the spark-shell OK, but my more complicated jobs require 
 external libraries so I need to build jars and execute them.
 Some code that causes this:
 println(Creating config)
 val conf = new SparkConf()
   .setMaster(clusterMaster)
   .setAppName(MyApp)
   .setSparkHome(sparkHome)
   .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100))
   .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100))
 println(Creating sc)
 implicit val sc = new SparkContext(conf)
 The output:
 Creating config
 Creating sc
 log4j:WARN No appenders could be found for logger 
 (akka.event.slf4j.Slf4jLogger).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup 
 timed out] [
 akka.remote.RemoteTransportException: Startup timed out
   at 
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
   at akka.remote.Remoting.start(Remoting.scala:191)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   ... 11 more
 ]
 Exception in thread main java.util.concurrent.TimeoutException: Futures 
 timed out after [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 

[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3

2014-06-21 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039854#comment-14039854
 ] 

sam commented on SPARK-1568:


When we upgrade to 1.0.0 I'll test this.

This particular problem was from quite a while back when our cluster was quite 
different from it is now.  At the moment we get the jets3 thing, which is 
supposed to go away in 1.0.0.

 Spark 0.9.0 hangs reading s3
 

 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've tried several jobs now and many of the tasks complete, then it get stuck 
 and just hangs.  The exact same jobs function perfectly fine if I distcp to 
 hdfs first and read from hdfs.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2212) Hash Outer Joins

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2212:


Summary: Hash Outer Joins  (was: HashJoin)

 Hash Outer Joins
 

 Key: SPARK-2212
 URL: https://issues.apache.org/jira/browse/SPARK-2212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2212) Hash Outer Joins

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2212:


Priority: Major  (was: Critical)

 Hash Outer Joins
 

 Key: SPARK-2212
 URL: https://issues.apache.org/jira/browse/SPARK-2212
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2214) Broadcast Join (aka map join)

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-2214.
---

Resolution: Duplicate

 Broadcast Join (aka map join)
 -

 Key: SPARK-2214
 URL: https://issues.apache.org/jira/browse/SPARK-2214
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1800) Add broadcast hash join operator

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1800:


Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-2211

 Add broadcast hash join operator
 

 Key: SPARK-1800
 URL: https://issues.apache.org/jira/browse/SPARK-1800
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2195) Parquet extraMetadata can contain key information

2014-06-21 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039915#comment-14039915
 ] 

Michael Armbrust commented on SPARK-2195:
-

Yeah, thanks for taking care of this so quickly!

 Parquet extraMetadata can contain key information
 -

 Key: SPARK-2195
 URL: https://issues.apache.org/jira/browse/SPARK-2195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.1, 1.1.0


 {code}
 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
 file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet
  start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same 
 as file fileSchema: message root {
   optional int32 id;
   optional binary title;
   optional int64 modified;
   optional binary text;
   optional binary username;
 }
  extraMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS!!! } 
 readSupportMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS 
 ***}}
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2195) Parquet extraMetadata can contain key information

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2195.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 Parquet extraMetadata can contain key information
 -

 Key: SPARK-2195
 URL: https://issues.apache.org/jira/browse/SPARK-2195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.1, 1.1.0


 {code}
 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
 file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet
  start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same 
 as file fileSchema: message root {
   optional int32 id;
   optional binary title;
   optional int64 modified;
   optional binary text;
   optional binary username;
 }
  extraMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS!!! } 
 readSupportMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS 
 ***}}
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-2227) Support dfs command

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-2227:
-


Sorry, i'll reopen this since you already have a PR with just this change.

 Support dfs command
 -

 Key: SPARK-2227
 URL: https://issues.apache.org/jira/browse/SPARK-2227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Minor

 Potentially just delegate to Hive. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2220) Fix remaining Hive Commands

2014-06-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2220:


Description: 
None of the following have an execution plan:
{code}
private[hive] case class ShellCommand(cmd: String) extends Command
private[hive] case class SourceCommand(filePath: String) extends Command
private[hive] case class AddFile(filePath: String) extends Command
{code}

dfs is being fixed in a related PR.

  was:
None of the following have an execution plan:
{code}
private[hive] case class DfsCommand(cmd: String) extends Command
private[hive] case class ShellCommand(cmd: String) extends Command
private[hive] case class SourceCommand(filePath: String) extends Command
private[hive] case class AddFile(filePath: String) extends Command
{code}


 Fix remaining Hive Commands
 ---

 Key: SPARK-2220
 URL: https://issues.apache.org/jira/browse/SPARK-2220
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.1.0


 None of the following have an execution plan:
 {code}
 private[hive] case class ShellCommand(cmd: String) extends Command
 private[hive] case class SourceCommand(filePath: String) extends Command
 private[hive] case class AddFile(filePath: String) extends Command
 {code}
 dfs is being fixed in a related PR.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915

2014-06-21 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039937#comment-14039937
 ] 

Ted Malaska commented on SPARK-1478:


OK I have made the changes requested.  But I had to do it in a different pull 
request.  Here is the new pull request link

https://github.com/apache/spark/pull/1168

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor
 Fix For: 1.1.0


 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039944#comment-14039944
 ] 

Sean Owen commented on SPARK-1675:
--

Is this still valid? Looking at the code, PCA is computed as the SVD of the 
covariance matrix. The means implicitly don't matter. they are not explicitly 
subtracted, and do not matter. Or is there still a doc change desired?

 Make clear whether computePrincipalComponents centers data
 --

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-06-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039946#comment-14039946
 ] 

Patrick Wendell commented on SPARK-1392:


I mentioned this on the pull request, but I think this was an instance of 
SPARK-1777. I'm running some tests locally on the pull request there to 
determine whether that was the case.

 Local spark-shell Runs Out of Memory With Default Settings
 --

 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough

 Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
 the spark-shell locally in out of the box configuration, and attempting to 
 cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 You can work around the issue by either decreasing 
 spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1846) RAT checks should exclude logs/ directory

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039945#comment-14039945
 ] 

Sean Owen commented on SPARK-1846:
--

Just looking over some old JIRAs. This appears to be resolved already. logs is 
excluded.

 RAT checks should exclude logs/ directory
 -

 Key: SPARK-1846
 URL: https://issues.apache.org/jira/browse/SPARK-1846
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Andrew Ash

 When there are logs in the logs/ directory, the rat check from 
 ./dev/check-license fails.
 ```
 aash@aash-mbp ~/git/spark$ find logs -type f
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 aash@aash-mbp ~/git/spark$ ./dev/check-license
 Could not find Apache license headers in the following files:
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 aash@aash-mbp ~/git/spark$
 ```



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1804) Mark 0.9.1 as released in JIRA

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039947#comment-14039947
 ] 

Sean Owen commented on SPARK-1804:
--

Looks like this can be closed as resolved.
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel

 Mark 0.9.1 as released in JIRA
 --

 Key: SPARK-1804
 URL: https://issues.apache.org/jira/browse/SPARK-1804
 Project: Spark
  Issue Type: Task
  Components: Documentation, Project Infra
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial

 0.9.1 has been released but is labeled as unreleased in SPARK JIRA project. 
 Please have it marked as released. Also please document that step in release 
 process.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039948#comment-14039948
 ] 

Sean Owen commented on SPARK-1803:
--

PR was committed so this is another that seems to be closeable.

 Rename test resources to be compatible with Windows FS
 --

 Key: SPARK-1803
 URL: https://issues.apache.org/jira/browse/SPARK-1803
 Project: Spark
  Issue Type: Task
  Components: Windows
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial

 {{git clone}} of master branch and then {{git status}} on Windows reports 
 untracked files:
 {noformat}
 # Untracked files:
 #   (use git add file... to include in what will be committed)
 #
 #   sql/hive/src/test/resources/golden/Column pruning
 #   sql/hive/src/test/resources/golden/Partition pruning
 #   sql/hive/src/test/resources/golden/Partiton pruning
 {noformat}
 Actual issue is that several files under 
 {{sql/hive/src/test/resources/golden}} directory have colon in name which is 
 invalid character in file name on Windows.
 Please have these files renamed to a Windows compatible file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1046) Enable to build behind a proxy.

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039950#comment-14039950
 ] 

Sean Owen commented on SPARK-1046:
--

Is this stale / resolved? I don't see this in the code at this point.

 Enable to build behind a proxy.
 ---

 Key: SPARK-1046
 URL: https://issues.apache.org/jira/browse/SPARK-1046
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.8.1
Reporter: Kousuke Saruta
Priority: Minor

 I tried to build spark-0.8.1 behind proxy and failed although I set 
 http/https.proxyHost, proxyPort, proxyUser, proxyPassword.
 I found it's caused by accessing  github using git protocol (git://).
 The URL is hard-corded in SparkPluginBuild.scala as follows.
 {code}
 lazy val junitXmlListener = 
 uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce)
 {code}
 After I rewrite the URL as follows, I could build successfully.
 {code}
 lazy val junitXmlListener = 
 uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;)
 {code}
 I think we should be able to build whether we are behind a proxy or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data

2014-06-21 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1675:
--

Summary: Make clear whether computePrincipalComponents requires centered 
data  (was: Make clear whether computePrincipalComponents centers data)

 Make clear whether computePrincipalComponents requires centered data
 

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-721) Fix remaining deprecation warnings

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039951#comment-14039951
 ] 

Sean Owen commented on SPARK-721:
-

This appears to be resolved as I don't think these warnings have been in the 
build for a while.

 Fix remaining deprecation warnings
 --

 Key: SPARK-721
 URL: https://issues.apache.org/jira/browse/SPARK-721
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.1
Reporter: Josh Rosen
Assignee: Gary Struthers
Priority: Minor
  Labels: Starter

 The recent patch to re-enable deprecation warnings fixed many of them, but 
 there's still a few left; it would be nice to fix them.
 For example, here's one in RDDSuite:
 {code}
 [warn] 
 /Users/joshrosen/Documents/spark/spark/core/src/test/scala/spark/RDDSuite.scala:32:
  method mapPartitionsWithSplit in class RDD is deprecated: use 
 mapPartitionsWithIndex
 [warn] val partitionSumsWithSplit = nums.mapPartitionsWithSplit {
 [warn]   ^
 [warn] one warning found
 {code}
 Also, it looks like Scala 2.9 added a second deprecatedSince parameter to 
 @Deprecated.   We didn't fill this in, which causes some additional warnings:
 {code}
 [warn] 
 /Users/joshrosen/Documents/spark/spark/core/src/main/scala/spark/RDD.scala:370:
  @deprecated now takes two arguments; see the scaladoc.
 [warn]   @deprecated(use mapPartitionsWithIndex)
 [warn]^
 [warn] one warning found
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data

2014-06-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039952#comment-14039952
 ] 

Sandy Ryza commented on SPARK-1675:
---

I think it still wouldn't hurt to add a remark that input data doesn't need to 
be centered.  Should have marked this trivial when I filed it - not a big deal 
either way.

 Make clear whether computePrincipalComponents requires centered data
 

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data

2014-06-21 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-1675:
--

Priority: Trivial  (was: Major)

 Make clear whether computePrincipalComponents requires centered data
 

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-06-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039953#comment-14039953
 ] 

Patrick Wendell commented on SPARK-1392:


Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows:

{code}
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean 
assembly/assembly
sc.textFile(/tmp/wiki_links).cache.count
{code}

The wiki_links file was download and extracted from here:

This worked with the proposed patch but failed with the default build.

 Local spark-shell Runs Out of Memory With Default Settings
 --

 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough

 Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
 the spark-shell locally in out of the box configuration, and attempting to 
 cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 You can work around the issue by either decreasing 
 spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-06-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039953#comment-14039953
 ] 

Patrick Wendell edited comment on SPARK-1392 at 6/21/14 9:15 PM:
-

Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows:

{code}
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean 
assembly/assembly
sc.textFile(/tmp/wiki_links).cache.count
{code}

The wiki_links file was download and extracted from here:
https://drive.google.com/file/d/0BwrkCxCycBCyTmlWYXp0MmdEakk/edit?usp=sharing

This worked with the proposed patch but failed with the default build.


was (Author: pwendell):
Okay great, I confirmed this is fixed by SPARK-1777. I tested as follows:

{code}
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean 
assembly/assembly
sc.textFile(/tmp/wiki_links).cache.count
{code}

The wiki_links file was download and extracted from here:

This worked with the proposed patch but failed with the default build.

 Local spark-shell Runs Out of Memory With Default Settings
 --

 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough

 Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
 the spark-shell locally in out of the box configuration, and attempting to 
 cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 You can work around the issue by either decreasing 
 spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-06-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1392.


Resolution: Duplicate

 Local spark-shell Runs Out of Memory With Default Settings
 --

 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough

 Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
 the spark-shell locally in out of the box configuration, and attempting to 
 cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 You can work around the issue by either decreasing 
 spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1996) Remove use of special Maven repo for Akka

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039965#comment-14039965
 ] 

Sean Owen commented on SPARK-1996:
--

PR: https://github.com/apache/spark/pull/1170

 Remove use of special Maven repo for Akka
 -

 Key: SPARK-1996
 URL: https://issues.apache.org/jira/browse/SPARK-1996
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Matei Zaharia
 Fix For: 1.0.1


 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html 
 Akka is now published to Maven Central, so our documentation and POM files 
 don't need to use the old Akka repo. It will be one less step for users to 
 worry about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2228:


 Summary: onStageSubmitted does not properly called so 
NoSuchElement will throw in onStageCompleted
 Key: SPARK-2228
 URL: https://issues.apache.org/jira/browse/SPARK-2228
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Baoxu Shi


We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not see code 
filed in the system so I put the code on gist.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-2228:
-

Description: 
We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not know who 
to put code here with highlight, so I put the code on gist to make the issue 
looks clean.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd

  was:
We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not see code 
filed in the system so I put the code on gist.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd


 onStageSubmitted does not properly called so NoSuchElement will throw in 
 onStageCompleted
 -

 Key: SPARK-2228
 URL: https://issues.apache.org/jira/browse/SPARK-2228
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Baoxu Shi

 We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
 iterative computing, but after several hundreds of iterations, there will be 
 `NoSuchElementsError`. We check the code and locate the problem at 
 `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
 called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
 in other HashMaps. So we think `onStageSubmitted` is not properly called. 
 `Spark` did add a stage but failed to send the message to listeners. When 
 sending `finish` message to listeners, the error occurs. 
 This problem will cause a huge number of `active stages` showing in 
 `SparkUI`, which is really annoying. But it may not affect the final result, 
 according to the result of my testing code.
 I'm willing to help solve this problem, any idea about which part should I 
 change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
 to do with it but it looks fine to me.
 FYI, here is the test code that could reproduce the problem. I do not know 
 who to put code here with highlight, so I put the code on gist to make the 
 issue looks clean.
 https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-2228:
-

Summary: onStageSubmitted does not properly called so NoSuchElement will be 
thrown in onStageCompleted  (was: onStageSubmitted does not properly called so 
NoSuchElement will throw in onStageCompleted)

 onStageSubmitted does not properly called so NoSuchElement will be thrown in 
 onStageCompleted
 -

 Key: SPARK-2228
 URL: https://issues.apache.org/jira/browse/SPARK-2228
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Baoxu Shi

 We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
 iterative computing, but after several hundreds of iterations, there will be 
 `NoSuchElementsError`. We check the code and locate the problem at 
 `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
 called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
 in other HashMaps. So we think `onStageSubmitted` is not properly called. 
 `Spark` did add a stage but failed to send the message to listeners. When 
 sending `finish` message to listeners, the error occurs. 
 This problem will cause a huge number of `active stages` showing in 
 `SparkUI`, which is really annoying. But it may not affect the final result, 
 according to the result of my testing code.
 I'm willing to help solve this problem, any idea about which part should I 
 change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
 to do with it but it looks fine to me.
 FYI, here is the test code that could reproduce the problem. I do not know 
 who to put code here with highlight, so I put the code on gist to make the 
 issue looks clean.
 https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics

2014-06-21 Thread Jun Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039988#comment-14039988
 ] 

Jun Xie commented on SPARK-:


Hi, Alexander. I would like to take up this task. Would you mind assigning this 
feature to me? Right now, I am doing multi-classification. So I have those 
concepts you mentioned and would like to implement this in Spark MLLib.

Thanks very much.

 Add multiclass evaluation metrics
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov

 There is no class in Spark MLlib for measuring the performance of multiclass 
 classifiers. This task involves adding such class and unit tests. The 
 following measures are to be implemented: per class, micro averaged and 
 weighted averaged Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1316) Remove use of Commons IO

2014-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039991#comment-14039991
 ] 

Sean Owen commented on SPARK-1316:
--

PR: https://github.com/apache/spark/pull/1173

Actually, Commons IO is not even a dependency right now.

 Remove use of Commons IO
 

 Key: SPARK-1316
 URL: https://issues.apache.org/jira/browse/SPARK-1316
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor

 (This follows from a side point on SPARK-1133, in discussion of the PR: 
 https://github.com/apache/spark/pull/164 )
 Commons IO is barely used in the project, and can easily be replaced with 
 equivalent calls to Guava or the existing Spark Utils.scala class.
 Removing a dependency feels good, and this one in particular can get a little 
 problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1698) Improve spark integration

2014-06-21 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li closed SPARK-1698.
--

Resolution: Won't Fix

 Improve spark integration
 -

 Key: SPARK-1698
 URL: https://issues.apache.org/jira/browse/SPARK-1698
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.1.0


 Use the shade plugin to create a big JAR with all the dependencies can cause 
 a few problems
 1. Missing jar's meta information
 2. Some file is covered, eg: plugin.xml
 3. Different versions of a jar may co-exist
 4. Too big, java 6 does not support



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2229) SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException

2014-06-21 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2229:
---

Summary: SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException 
 (was: SizeBasedRollingPolicy  throw an java.lang.IllegalArgumentException)

 SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException
 --

 Key: SPARK-2229
 URL: https://issues.apache.org/jira/browse/SPARK-2229
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: JDK6 Mac OS x
Reporter: Guoqiang Li
Priority: Blocker

 [RollingPolicy.scala#L112|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L112]
   error in JDK 6
 {code}
 import java.text.SimpleDateFormat
 import java.util.Calendar
   val formatter = new SimpleDateFormat(---MM-dd--HH-mm-ss--)
 {code}
 =
 {code}
 val formatter = new SimpleDateFormat(---MM-dd--HH-mm-ss--)
 java.lang.IllegalArgumentException: Illegal pattern character 'Y'
   at java.text.SimpleDateFormat.compile(SimpleDateFormat.java:768)
   at java.text.SimpleDateFormat.initialize(SimpleDateFormat.java:575)
   at java.text.SimpleDateFormat.init(SimpleDateFormat.java:500)
   at java.text.SimpleDateFormat.init(SimpleDateFormat.java:475)
   at .init(console:9)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2222) Add multiclass evaluation metrics

2014-06-21 Thread Jun Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040034#comment-14040034
 ] 

Jun Xie commented on SPARK-:


Nice work.

I am reading the implementation of MulticlassMetrics. According to your code, 
for Micro average, you calculate the recall and then let precision and f1 
measure equal to the recall. I am not sure whether this makes sense. 

According to this post: 
http://rushdishams.blogspot.com/2011/08/micro-and-macro-average-of-precision.html

Assume we just have three classes. For each class, we have three numbers, true 
positive(tp), false positive(fp), false negative(fn). Hence, we have tp1, fp1 
and fn1 for class 1. so on so forth.

For Micro-Average Precision: (tp1 + tp2 + tp3) / (tp1 + tp2 + tp3 + fp1 + fp2 + 
fp3)
For Micro-Average Recall: (tp1 + tp2 + tp3) / (tp1 + tp2 + tp3 + fn1 + fn2 + 
fn3)
For Micro-Average F1Measure: it is just the harmonic mean of precision and 
recall.

Based on the above definition, recall and precision should not be the same. Is 
it correct?

 Add multiclass evaluation metrics
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alexander Ulanov

 There is no class in Spark MLlib for measuring the performance of multiclass 
 classifiers. This task involves adding such class and unit tests. The 
 following measures are to be implemented: per class, micro averaged and 
 weighted averaged Precision, Recall and F1-Measure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)