[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-08 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203316#comment-14203316
 ] 

Lianhui Wang commented on SPARK-2468:
-

ok, thanks.[~adav] i will try to do as you say.

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4305) yarn-alpha profile won't build due to network/yarn module

2014-11-08 Thread Sean Owen (JIRA)
Sean Owen created SPARK-4305:


 Summary: yarn-alpha profile won't build due to network/yarn module
 Key: SPARK-4305
 URL: https://issues.apache.org/jira/browse/SPARK-4305
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sean Owen
Priority: Minor


SPARK-3797 introduced the {{network/yarn}} module, but its YARN code depends on 
YARN APIs not present in older versions covered by the {{yarn-alpha}}. As a 
result builds like {{mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 
-DskipTests clean package}} fail.

The solution is just to not build {{network/yarn}} with profile {{yarn-alpha}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4305) yarn-alpha profile won't build due to network/yarn module

2014-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203331#comment-14203331
 ] 

Apache Spark commented on SPARK-4305:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3167

 yarn-alpha profile won't build due to network/yarn module
 -

 Key: SPARK-4305
 URL: https://issues.apache.org/jira/browse/SPARK-4305
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sean Owen
Priority: Minor

 SPARK-3797 introduced the {{network/yarn}} module, but its YARN code depends 
 on YARN APIs not present in older versions covered by the {{yarn-alpha}}. As 
 a result builds like {{mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 
 -DskipTests clean package}} fail.
 The solution is just to not build {{network/yarn}} with profile {{yarn-alpha}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-899) Outdated Bagel documentation

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-899.
-
Resolution: Won't Fix

Trawling old issues again... I assume this is a WontFix because GraphX has 
superseded Bagel.

 Outdated Bagel documentation
 

 Key: SPARK-899
 URL: https://issues.apache.org/jira/browse/SPARK-899
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.7.3
Reporter: Matteo Ceccarello

 The documentation for Bagel at 
 http://spark.incubator.apache.org/docs/latest/bagel-programming-guide.html 
 seems to be outdated.
 In the code example it refers to an Edge class that does not exist in Bagel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-952) Python version of Gaussian Mixture Model

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-952.
-
Resolution: Duplicate

I assume this is superseded, if anything, by SPARK-3588

 Python version of Gaussian Mixture Model
 

 Key: SPARK-952
 URL: https://issues.apache.org/jira/browse/SPARK-952
 Project: Spark
  Issue Type: Story
  Components: Examples
Affects Versions: 0.7.3
Reporter: caizhua
Priority: Minor
  Labels: Learning

 This piece of code is written by Shangyu Luo at Rice University. The code is 
 to learn the Gaussian Mixture Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-951) Gaussian Mixture Model

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-951.
-
Resolution: Duplicate

Duplicate of I assume this is superseded, if anything, by SPARK-3588

 Gaussian Mixture Model
 --

 Key: SPARK-951
 URL: https://issues.apache.org/jira/browse/SPARK-951
 Project: Spark
  Issue Type: Story
  Components: Examples
Affects Versions: 0.7.3
Reporter: caizhua
Priority: Critical
  Labels: Learning, Machine, Model

 This code includes the code for Gaussian Mixture Model. The input file named 
 Gmm_spark.tbl is the input for this program.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-956) The Spark python program for Lasso

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-956.
-
Resolution: Won't Fix

I assume this is WontFix as there was no followup. Spark has L1 regularization 
implemented already anyway.

 The Spark python program for Lasso
 --

 Key: SPARK-956
 URL: https://issues.apache.org/jira/browse/SPARK-956
 Project: Spark
  Issue Type: Story
  Components: Examples
Affects Versions: 0.7.3
Reporter: caizhua

 The code describes the Spark python implementation of Lasso



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-954) One repeated sampling, and I am not sure if it is correct.

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-954.
-
Resolution: Won't Fix

From the discussion, and later ones about guarantees of determinism in RDDs, 
sounds like this is working as intended.

 One repeated sampling, and I am not sure if it is correct.
 --

 Key: SPARK-954
 URL: https://issues.apache.org/jira/browse/SPARK-954
 Project: Spark
  Issue Type: Story
Affects Versions: 0.7.3
Reporter: caizhua

 This piece of code reads the dataset, and then has two operations on the 
 dataset. If I consider the RDD as a view definition, I think the result is 
 correct. However, since the first iteration does result_sample.count(), then 
 I was wondering whether we should repeat the computation in the 
 initialize_doc_topic_word_count(.) function, when we run the the second 
 result_sample.map(lambda (block_id, doc_prob): doc_prob).count(). Since 
 people write Spark as a program not as a database view, sometimes it is 
 confusing. For example, considering there  initialize_doc_topic_word_count(.) 
  is a statistical function with runtime seeds, I am not sure if this have 
 impact on the result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-976) WikipediaPageRand doesn't work anymore

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-976.
-
Resolution: Won't Fix

I assume this is also WontFix as it is a Bagel example.

 WikipediaPageRand doesn't work anymore
 --

 Key: SPARK-976
 URL: https://issues.apache.org/jira/browse/SPARK-976
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 0.7.3
Reporter: Konstantin Boudnik

 Looks like wikipedia doesn't public the pages info in WEX format anymore, but 
 instead is doing page dumps in XML format.
 Because of that the example is failing with IOOBE as it expects tab-separated 
 input strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-987) Cannot start workers successfully with hadoop 2.2.0

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-987.
-
Resolution: Not a Problem

This just looks like a classic version mismatch between client and server. The 
app perhaps has embedded Hadoop libs instead of using 'provided' libs from the 
server installation.

 Cannot start workers successfully with hadoop 2.2.0
 ---

 Key: SPARK-987
 URL: https://issues.apache.org/jira/browse/SPARK-987
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.0
Reporter: 刘旭
Priority: Minor
  Labels: 2.2.0, hadoop

 Cannot start workers successfully with hadoop 2.2.0.
 I build with:
 $make-distribution.sh -hadoop 2.2.0
 P.S.
 Can work well with hadoop 2.0.5-alpha.
 But cannot connect the hadoop 2.2.0 successfully with this exception :
 scala var lines  = 
 sc.textFile(hdfs://localhost:9000/user/hadoop/hadoop/hadoop-hadoop-jobtracker-master.log)
 lines: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at 
 console:12
 scala lines.count
 java.io.IOException: Failed on local exception: 
 com.google.protobuf.InvalidProtocolBufferException: Message missing required 
 fields: callId, status; Host Details : local host is: master/192.168.3.103; 
 destination host is: localhost:9000; 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-819) netty: ChannelInboundByteHandlerAdapter no longer exist in 4.0.3.Final

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-819.
-
Resolution: Fixed

This must have been fixed along the way, as the class 
ChannelInboundByteHandlerAdapter  is not used in the code now, nor is that 
version of Netty.

 netty: ChannelInboundByteHandlerAdapter no longer exist in 4.0.3.Final
 --

 Key: SPARK-819
 URL: https://issues.apache.org/jira/browse/SPARK-819
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 0.8.0
Reporter: Thomas Graves

 It appears the netty shuffle code uses netty version 4.0.0.Beta2, which by 
 the tag was in beta. They now have 4.0.2.Final which doesn't include the api 
 ChannelInboundByteHandlerAdapter which is used by the FileClientHandler. We 
 should move to use a stable api.  It looks like it was replaced with 
 ChannelInboundHandlerAdapter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-905) Not able to run Job on remote machine

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-905.
-
Resolution: Cannot Reproduce

This looks like something that was either long since fixed, or just a matter of 
not having the Spark installation set up on each machine. The 
compute-classpath.sh script does exist in bin/ in the tree and distro.

 Not able to run Job on  remote machine
 --

 Key: SPARK-905
 URL: https://issues.apache.org/jira/browse/SPARK-905
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.3
Reporter: Ayush

 I have two machines A and B.  On machine A, I run
  ./run spark.deploy.master.Master 
 to start Master.
 Master URL is spark://abc-vostro.local:7077. 
 Now on machine B, I run 
 ./run spark.deploy.worker.Worker spark://abc-vostro.local:7077
 Now worker has been registered to  master. 
 Now I want to run a simple job on cluster. 
 Here is SimpleJob.scala
 package spark.examples
  
 import spark.SparkContext
 import SparkContext._
  
 object SimpleJob {
   def main(args: Array[String]) {
 val logFile = s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY/File 
 Name
 val sc = new SparkContext(spark://abc-vostro.local:7077, Simple Job,
   System.getenv(SPARK_HOME), 
 Seq(/home/abc/spark-scala-2.10/examples/target/scala-2.10/spark-examples_2.10-0.8.0-SNAPSHOT.jar))
 val logData = sc.textFile(logFile)
 val numsa = logData.filter(line = line.contains(a)).count
 val numsb = logData.filter(line = line.contains(b)).count
 println(total a : %s, total b : %s.format(numsa, numsb))
   }
 }  
 This file is located at 
 /home/abc/spark-scala-2.10/examples/src/main/scala/spark/examples on 
 machine A.
 Now on machine A, I run sbt/sbt package.
 When I run
  MASTER=spark://abc-vostro.local:7077 ./run spark.examples.SimpleJob
 to run my job, I am getting below exception on both machines A and B,
 (class java.io.IOException: Cannot run program 
 /home/abc/spark-scala-2.10/bin/compute-classpath.sh (in directory .): 
 error=2, No such file or directory)
 Could you please help me to resolve this? This is probably something I'm 
 missing any configuration on my end. 
 Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1074) JavaPairRDD as Object File

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1074.
--
Resolution: Not a Problem

Am I right in thinking that if you want to save a JavaPairRDD to HDFS, you have 
key-value pairs, and so you want to use JavaPairRDD.saveAsNewAPIHadoopFile, and 
SparkContext.sequenceFile to read it? This works. objectFile doesn't seem like 
the right approach anyway.

 JavaPairRDD as Object File
 --

 Key: SPARK-1074
 URL: https://issues.apache.org/jira/browse/SPARK-1074
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Java API
Affects Versions: 0.9.0
Reporter: Kevin Mader
Priority: Minor

 So I can perform a save command on a JavaPairRDD
 {code:java}
 static public void HSave(JavaPairRDDD3int, int[] baseImg,String path) {
   final String outpath=(new File(path)).getAbsolutePath();
   baseImg.saveAsObjectFile(outpath);
 }
 {code}
 When I use the objectFile command from the JavaSparkContext 
 {code:java}
 static public  ReadObjectFile(JavaSparkContext jsc, final String path) {
   JavaPairRDDD3int, int[] newImage=(JavaPairRDDD3int,int[]) 
 jsc.objectFile(path);
 }
 {code}
 I get an error cannot cast from JavaRDD to JavaPairRDD. Is there a way to get 
 back to JavaPairRDD or will I need to map my data to a JavaRDD, save, load, 
 then remap the JavaRDD back to the JavaPairRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1227) Diagnostics for ClassificationRegression

2014-11-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203354#comment-14203354
 ] 

Sean Owen commented on SPARK-1227:
--

Is this still relevant now that conventional classifier and regressor metrics 
are implemented in MLlib? You wouldn't be able to compare models by their loss 
function in general anyway.

 Diagnostics for ClassificationRegression
 -

 Key: SPARK-1227
 URL: https://issues.apache.org/jira/browse/SPARK-1227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Martin Jaggi
Assignee: Martin Jaggi

 Currently, the attained objective function is not computed (for efficiency 
 reasons, as one evaluation requires one full pass through the data).
 For diagnostics and comparing different algorithms, we should however provide 
 this as a separate function (one MR).
 Doing this requires the loss and regularizer functions themselves, not only 
 their gradients (which are currently in the Gradient class). How about adding 
 the new function directly on the corresponding models in classification/* and 
 regression/* ? Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1196) val variables not available within RDD map on cluster app; are on shell or local

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1196.
--
Resolution: Cannot Reproduce

 val variables not available within RDD map on cluster app; are on shell or 
 local
 

 Key: SPARK-1196
 URL: https://issues.apache.org/jira/browse/SPARK-1196
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr

 When this code
 {code}
 def foo = foo
   val bar = bar
   val data = sc.parallelize(Seq(a))
   data.map{a = print(1,foo,bar);a}.map{a = print(2,foo,bar);a}.map{a = 
 print(3,foo,bar);a}.collect()
 {code}
 is run on a cluster on the spark shell a slave's stdout is
 {code}
 (1,foo,bar)(2,foo,bar)(3,foo,bar)
 {code}
 as expected.
 However when the code
 {code}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.SparkContext._
 object twitterAggregation extends App {
   val conf = new SparkConf()
 .setMaster(spark://xx.compute-1.amazonaws.com:7077)
 .setAppName(testCase)
 .setJars(List(target/scala-2.10/spark-test-case_2.10-1.0.jar))
 .setSparkHome(/root/spark/)
   val sc = new SparkContext(conf)
   def foo = foo
   val bar = bar
   val data = sc.parallelize(Seq(a))
   data.map{a = print(1,foo,bar);a}.map{a = print(2,foo,bar);a}.map{a = 
 print(3,foo,bar);a}.collect()
 }
 {code}
 is run against a cluster as an application via sbt the stdout on a slave is
 {code}
 (1,foo,null)(2,foo,null)(3,foo,null)
 {code}
 The variable declared with val is now null when the anon functions in the map 
 are executed.
 When the application is run in local mode the output is 
 {code}
 (1,foo,bar)(2,foo,bar)(3,foo,bar)
 {code}
 as wanted.
 build.sbt is 
 {code}
 name := spark-test-case
 version := 1.0
 scalaVersion:=2.10.3
 resolvers ++= Seq(Akka Repository at http://repo.akka.io/releases/;)
 libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 
 0.9.0-incubating)
 {code}
 To avoid firewall and NAT issues the project directory is rsynced onto the 
 master where is is build with SBT 0.13.1
 {code}
 wget 
 http://repo.scala-sbt.org/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.13.1/sbt.rpm
   rpm --install sbt.rpm
 sbt package  sbt run
 {code}
 Cluster created with scripts in the hadoop2 0.9.0 download.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1344) Scala API docs for top methods

2014-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203370#comment-14203370
 ] 

Apache Spark commented on SPARK-1344:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3168

 Scala API docs for top methods
 --

 Key: SPARK-1344
 URL: https://issues.apache.org/jira/browse/SPARK-1344
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.0
Reporter: Diana Carroll

 The RDD.top() methods are documented as follows:
 bq. Returns the top *K* elements from this RDD using the natural ordering for 
 *T*.
 bq. Returns the top *K* elements from this RDD as defined by the specified 
 Comparator[[T]].
 I believe those should read
 bq. Returns the top *num* elements from this RDD using the natural ordering 
 for *K*.
 bq. Returns the top *num* elements from this RDD as defined by the specified 
 Comparator[[K]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-971) Link to Confluence wiki from project website / documentation

2014-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203372#comment-14203372
 ] 

Apache Spark commented on SPARK-971:


User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3169

 Link to Confluence wiki from project website / documentation
 

 Key: SPARK-971
 URL: https://issues.apache.org/jira/browse/SPARK-971
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 Spark's Confluence wiki 
 (https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage) is really 
 hard to find; try a Google search for apache spark wiki, for example.
 We should link to the wiki from the Spark project website and documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1267) Add a pip installer for PySpark

2014-11-08 Thread Prabin Banka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabin Banka closed SPARK-1267.
---
Resolution: Unresolved

Closing this PR for now. 

 Add a pip installer for PySpark
 ---

 Key: SPARK-1267
 URL: https://issues.apache.org/jira/browse/SPARK-1267
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Prabin Banka
Priority: Minor
  Labels: pyspark

 Please refer to this mail archive,
 http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1509) add zipWithIndex zipWithUniqueId methods to java api

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1509:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 add zipWithIndex zipWithUniqueId methods to java api
 

 Key: SPARK-1509
 URL: https://issues.apache.org/jira/browse/SPARK-1509
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Reporter: Guoqiang Li
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1509) add zipWithIndex zipWithUniqueId methods to java api

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1509.
--
Resolution: Fixed

This was actually fixed in about 1.1

 add zipWithIndex zipWithUniqueId methods to java api
 

 Key: SPARK-1509
 URL: https://issues.apache.org/jira/browse/SPARK-1509
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Reporter: Guoqiang Li
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1530) Streaming UI test can hang indefinitely

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1530.
--
Resolution: Fixed

I assume this got fixed along the way, either by [~tdas]'s changes or 
[~shaneknapp]'s changes to Jenkins?

 Streaming UI test can hang indefinitely
 ---

 Key: SPARK-1530
 URL: https://issues.apache.org/jira/browse/SPARK-1530
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Tathagata Das

 This has been causing Jenkins to hang recently:
 {code}
 pool-1-thread-1 prio=10 tid=0x7f4b9449f000 nid=0x6c37 runnable 
 [0x7f4b8a26c000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
 - locked 0x0007cad700d0 (a java.io.BufferedInputStream)
 at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
 at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
 at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
 - locked 0x0007cad662b8 (a 
 sun.net.www.protocol.http.HttpURLConnection)
 at java.net.URL.openStream(URL.java:1037)
 at scala.io.Source$.fromURL(Source.scala:140)
 at scala.io.Source$.fromURL(Source.scala:130)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply$mcV$sp(UISuite.scala:57)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply(UISuite.scala:56)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply(UISuite.scala:56)
 at 
 org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:394)
 at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:408)
 at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:437)
 at 
 org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:477)
 at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
 at 
 org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:477)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(UISuite.scala:56)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(UISuite.scala:54)
 at 
 org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:60)
 at 
 org.apache.spark.ui.UISuite$$anonfun$2.apply$mcV$sp(UISuite.scala:54)
 at org.apache.spark.ui.UISuite$$anonfun$2.apply(UISuite.scala:54)
 at org.apache.spark.ui.UISuite$$anonfun$2.apply(UISuite.scala:54)
 at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
 at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
 at org.apache.spark.ui.UISuite.withFixture(UISuite.scala:37)
 at 
 org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
 at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 at 
 org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
 at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
 at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
 at org.apache.spark.ui.UISuite.runTest(UISuite.scala:37)
 at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 at 
 org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
 at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:260)
 at 
 org.scalatest.SuperEngine$$anonfun$org$scalatest$SuperEngine$$runTestsInBranch$1.apply(Engine.scala:249)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:249)
 at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:326)
 at org.scalatest.FunSuite$class.runTests(FunSuite.scala:1304)
 at org.apache.spark.ui.UISuite.runTests(UISuite.scala:37)
 at org.scalatest.Suite$class.run(Suite.scala:2303)
 at 
 org.apache.spark.ui.UISuite.org$scalatest$FunSuite$$super$run(UISuite.scala:37)
 at org.scalatest.FunSuite$$anonfun$run$1.apply(FunSuite.scala:1310)
 at 

[jira] [Resolved] (SPARK-1229) train on array (in addition to RDD)

2014-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1229.
--
Resolution: Not a Problem

If I may be so bold: generally, you train on lots of data, so an RDD makes more 
sense than an Array as training input. That said, you can always parallelize an 
Array as an RDD and train on it, which is most of the use case here. If you 
really mean you need a sample method that returns an RDD, yes that exists as 
you see.

 train on array (in addition to RDD)
 ---

 Key: SPARK-1229
 URL: https://issues.apache.org/jira/browse/SPARK-1229
 Project: Spark
  Issue Type: Story
  Components: MLlib
Reporter: Arshak Navruzyan

 since predict method accepts either RDD or Array for consistency so should 
 train.  (particularly since RDD.takeSample() returns Array)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1887) Update Contributors Guide with useful data from past threads

2014-11-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203409#comment-14203409
 ] 

Sean Owen commented on SPARK-1887:
--

[~sujeetv] do you plan to work on this? you just need to open a PR, and don't 
need this assigned to you.

 Update Contributors Guide with useful data from past threads
 --

 Key: SPARK-1887
 URL: https://issues.apache.org/jira/browse/SPARK-1887
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
 Environment: Development
Reporter: Sujeet Varakhedi
Assignee: Sujeet Varakhedi
Priority: Minor
  Labels: documentation, newbie
   Original Estimate: 168h
  Remaining Estimate: 168h

 The goal here is to mine through dev email threads and add useful data to 
 Contributors Page. This will save quite bit of time for new-comers like me 
 to ramp up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-08 Thread Varadharajan (JIRA)
Varadharajan created SPARK-4306:
---

 Summary: LogisticRegressionWithLBFGS support for PySpark MLlib 
 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Varadharajan


Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
interfact. This task is to add support for LogisticRegressionWithLBFGS 
algorithm as include examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-08 Thread Varadharajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203480#comment-14203480
 ] 

Varadharajan edited comment on SPARK-4306 at 11/8/14 4:01 PM:
--

I would like to work on this. Please assign this task to me


was (Author: srinathsmn):
Please assign this task to me.

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm as include examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-08 Thread Varadharajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203480#comment-14203480
 ] 

Varadharajan commented on SPARK-4306:
-

Please assign this task to me.

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm as include examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-08 Thread Varadharajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varadharajan updated SPARK-4306:

Issue Type: New Feature  (was: Task)

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm as include examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2624) Datanucleus jars not accessible in yarn-cluster mode

2014-11-08 Thread Jim Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203631#comment-14203631
 ] 

Jim Lim commented on SPARK-2624:


Line 223 in {{org/apache/spark/deploy/yarn/ClientBase.java}} seems like a good 
spot.

{quote}
def prepareLocalResources(...) = \{
val cachedSecondaryJarLinks = ListBuffer.empty\[String\]
...
\}
{quote}

The datanucleus jars that need to be added are: datanucleus-api-jdo-3.2.1.jar, 
datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar

 Datanucleus jars not accessible in yarn-cluster mode
 

 Key: SPARK-2624
 URL: https://issues.apache.org/jira/browse/SPARK-2624
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Jim Lim
 Fix For: 1.2.0


 This is because we add it to the class path of the command that launches 
 spark submit, but the containers never get it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

2014-11-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4301:
-
Affects Version/s: (was: 1.2.0)
   1.0.1

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.1, 1.2.0


 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

2014-11-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4301.
--
Resolution: Fixed

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.1, 1.2.0


 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

2014-11-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4301:
-
Fix Version/s: 1.2.0
   1.1.1

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.1, 1.2.0


 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

2014-11-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4301:
-
Fix Version/s: 1.0.3

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.1.1, 1.2.0, 1.0.3


 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203757#comment-14203757
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Thanks for putting this together -- This is looking great ! I just 
had a couple of quick questions, clarifications

1. My preference would be to just have a single AMI across Spark versions for a 
couple of reasons. First it reduces steps for every release (even though 
creating AMIs is definitely much simpler now !). Also the number of AMIs we 
maintain could get large if we do this for every minor and major release like 
1.1.1. [~pwendell] could probably comment more on the release process etc.

2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still 
installed on startup ? The flexibility we right now have of switching between 
Hadoop 1, Hadoop 2, YARN etc. is useful for testing. (Related packer question: 
Are the [init scripts| 
https://github.com/nchammas/spark-ec2/blob/packer/packer/spark-packer.json#L129]
 run during AMI creation or during startup ?)

3. Do you have some benchmarks for the new AMI without Spark 1.1.0 
pre-installed ? [We right now have old AMI vs. new AMI with 
spark|https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run]
 . I see a couple of huge wins in the new AMI (from SSH wait time, ganglia init 
etc.) which I guess we should get even without Spark being pre-installed.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4307) Initialize FileDescriptor lazily in FileRegion

2014-11-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-4307:
--

 Summary: Initialize FileDescriptor lazily in FileRegion
 Key: SPARK-4307
 URL: https://issues.apache.org/jira/browse/SPARK-4307
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Reynold Xin
Assignee: Reynold Xin


We use Netty's DefaultFileRegion to do zero copy send. However, 
DefaultFileRegion requires a FileDescriptor, which results in a large number of 
opened files in larger workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4307) Initialize FileDescriptor lazily in FileRegion

2014-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203778#comment-14203778
 ] 

Apache Spark commented on SPARK-4307:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3172

 Initialize FileDescriptor lazily in FileRegion
 --

 Key: SPARK-4307
 URL: https://issues.apache.org/jira/browse/SPARK-4307
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Reynold Xin
Assignee: Reynold Xin

 We use Netty's DefaultFileRegion to do zero copy send. However, 
 DefaultFileRegion requires a FileDescriptor, which results in a large number 
 of opened files in larger workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2213) Sort Merge Join

2014-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203783#comment-14203783
 ] 

Apache Spark commented on SPARK-2213:
-

User 'Ishiihara' has created a pull request for this issue:
https://github.com/apache/spark/pull/3173

 Sort Merge Join
 ---

 Key: SPARK-2213
 URL: https://issues.apache.org/jira/browse/SPARK-2213
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Liquan Pei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203786#comment-14203786
 ] 

Nicholas Chammas commented on SPARK-3821:
-

Thanks for the feedback [~shivaram].

{quote}
1. My preference would be to just have a single AMI across Spark versions for a 
couple of reasons. 
{quote}

I agree. Maintaining images for specific versions of Spark is worth it only if 
you're really crazy about getting the lowest cluster launch times possible. 
Well, that was my [original motivation | 
http://apache-spark-developers-list.1001551.n3.nabble.com/EC2-clusters-ready-in-launch-time-30-seconds-td7262.html]
 for doing this work, but ultimately I agree the complexity is not worth it at 
the moment. I'll take this out unless someone wants to advocate for leaving it 
in.

{quote}
2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still 
installed on startup ?
{quote}

Currently, I have it set to install Hadoop 2 on the AMIs with Spark 
pre-installed. Again, this was done with the intention of aiming for the lowest 
launch time possible, but if we'd like to do away with the Spark-pre-installed 
AMIs then this is not an issue.

{quote}
Are the init scripts run during AMI creation or during startup ?
{quote}

For the AMIs with Spark pre-installed, they are run during AMI creation. That's 
why the [init runtimes in the second benchmark | 
https://github.com/nchammas/spark-ec2/blob/214d5e4cac392a0eac21f949fe25c0075044411f/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run]
 are all 0 ms; the init script sees that such and such is already installed and 
just exits.

{quote}
3. Do you have some benchmarks for the new AMI without Spark 1.1.0 
pre-installed ?
{quote}

Nope, but I can run one and get back to you on Monday or Tuesday with those 
numbers.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org