date:20150208


 [ 
https://issues.apache.org/jira/browse/SPARK-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5366.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4162
[https://github.com/apache/spark/pull/4162]

 check for mode of private key file
 --

 Key: SPARK-5366
 URL: https://issues.apache.org/jira/browse/SPARK-5366
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: liu chang
Priority: Minor
 Fix For: 1.4.0


 check the mode for the private key. User should set it to 600.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k


 [ 
https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5656.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4433
[https://github.com/apache/spark/pull/4433]

 NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large 
 n and/or large k
 --

 Key: SPARK-5656
 URL: https://issues.apache.org/jira/browse/SPARK-5656
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Mark Bittmann
Priority: Minor
 Fix For: 1.4.0


 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail 
 with a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
 Integer.MAX_VALUE. These values are currently unchecked and allow for the 
 array to be initialized to a value greater than Integer.MAX_VALUE. I have 
 written the below 'require' to fail this condition gracefully. I will submit 
 a pull request. 
 require(ncv * n.toLong  Integer.MAX_VALUE, Product of 2*k*n must be smaller 
 than  +
   sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
 dimension n = $n)
 Here is the exception that occurs from computeSVD with large k and/or n: 
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5672) Don't return `ERROR 500` when have missing args


 [ 
https://issues.apache.org/jira/browse/SPARK-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5672.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4239
[https://github.com/apache/spark/pull/4239]

 Don't return `ERROR 500` when have missing args
 ---

 Key: SPARK-5672
 URL: https://issues.apache.org/jira/browse/SPARK-5672
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kirill A. Korinskiy
 Fix For: 1.3.0


 Spark web UI return HTTP ERROR 500 when GET arguments is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5672) Don't return `ERROR 500` when have missing args


 [ 
https://issues.apache.org/jira/browse/SPARK-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5672:
-
Priority: Minor  (was: Major)
Assignee: Kirill A. Korinskiy

 Don't return `ERROR 500` when have missing args
 ---

 Key: SPARK-5672
 URL: https://issues.apache.org/jira/browse/SPARK-5672
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kirill A. Korinskiy
Assignee: Kirill A. Korinskiy
Priority: Minor
 Fix For: 1.3.0


 Spark web UI return HTTP ERROR 500 when GET arguments is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5673) Implement Streaming wrapper for all linear methos


[ 
https://issues.apache.org/jira/browse/SPARK-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311163#comment-14311163
 ] 

Apache Spark commented on SPARK-5673:
-

User 'catap' has created a pull request for this issue:
https://github.com/apache/spark/pull/4456

 Implement Streaming wrapper for all linear methos
 -

 Key: SPARK-5673
 URL: https://issues.apache.org/jira/browse/SPARK-5673
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Kirill A. Korinskiy

 Now spark had streaming wrapper for Logistic and Linear regressions only.
 So, implement wrapper for SVM, Lasso and Ridge Regression will make streaming 
 fashion more useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5673) Implement Streaming wrapper for all linear methos

2015-02-08 Thread Kirill A. Korinskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill A. Korinskiy updated SPARK-5673:
---
Component/s: MLlib

 Implement Streaming wrapper for all linear methos
 -

 Key: SPARK-5673
 URL: https://issues.apache.org/jira/browse/SPARK-5673
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Kirill A. Korinskiy

 Now spark had streaming wrapper for Logistic and Linear regressions only.
 So, implement wrapper for SVM, Lasso and Ridge Regression will make streaming 
 fashion more useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize Scala/Java test execution


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311270#comment-14311270
 ] 

Sean Owen commented on SPARK-3431:
--

Haven't tried anything recently, no.

 Parallelize Scala/Java test execution
 -

 Key: SPARK-3431
 URL: https://issues.apache.org/jira/browse/SPARK-3431
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: SPARK-3431-srowen-attempt.patch


 Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
 strategy to cut test time down is to parallelize the execution of the tests. 
 Doing that may in turn require some prerequisite changes to be made to how 
 certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311272#comment-14311272
 ] 

Sean Owen commented on SPARK-5625:
--

I think you may be running into problems with older version of zip that can't 
uncompress a zip file with more than 65535 files.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


 [ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-761:

Description: 
Not sure what component this falls under, or if this is still an issue.
Patrick Wendell / Matei Zaharia?

  was:I don't know how possible this is, as incompatibilities manifest in many 
and low-level ways.


I don't know how possible this is, as incompatibilities manifest in many and 
low-level ways.

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia

 Not sure what component this falls under, or if this is still an issue.
 Patrick Wendell / Matei Zaharia?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases


 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5140:
-
Component/s: Spark Core

 Two RDDs which are scheduled concurrently should be able to wait on parent in 
 all cases
 ---

 Key: SPARK-5140
 URL: https://issues.apache.org/jira/browse/SPARK-5140
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Corey J. Nolet
  Labels: features

 Not sure if this would change too much of the internals to be included in the 
 1.2.1 but it would be very helpful if it could be.
 This ticket is from a discussion between myself and [~ilikerps]. Here's the 
 result of some testing that [~ilikerps] did:
 bq. I did some testing as well, and it turns out the wait for other guy to 
 finish caching logic is on a per-task basis, and it only works on tasks that 
 happen to be executing on the same machine. 
 bq. Once a partition is cached, we will schedule tasks that touch that 
 partition on that executor. The problem here, though, is that the cache is in 
 progress, and so the tasks are still scheduled randomly (or with whatever 
 locality the data source has), so tasks which end up on different machines 
 will not see that the cache is already in progress.
 {code}
 Here was my test, by the way:
 import scala.concurrent.ExecutionContext.Implicits.global
 import scala.concurrent._
 import scala.concurrent.duration._
 val rdd = sc.parallelize(0 until 8).map(i = { Thread.sleep(1); i 
 }).cache()
 val futures = (0 until 4).map { _ = Future { rdd.count } }
 Await.result(Future.sequence(futures), 120.second)
 {code}
 bq. Note that I run the future 4 times in parallel. I found that the first 
 run has all tasks take 10 seconds. The second has about 50% of its tasks take 
 10 seconds, and the rest just wait for the first stage to finish. The last 
 two runs have no tasks that take 10 seconds; all wait for the first two 
 stages to finish.
 What we want is the ability to fire off a job and have the DAG figure out 
 that two RDDs depend on the same parent so that when the children are 
 scheduled concurrently, the first one to start will activate the parent and 
 both will wait on the parent. When the parent is done, they will both be able 
 to finish their work concurrently. We are trying to use this pattern by 
 having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)


 [ 
https://issues.apache.org/jira/browse/SPARK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5334:
-
Component/s: Input/Output

Related to, or resolved by, SPARK-5671?

 NullPointerException when getting files from S3 (hadoop 2.3+)
 -

 Key: SPARK-5334
 URL: https://issues.apache.org/jira/browse/SPARK-5334
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
 Environment: Spark 1.2 built with Hadoop 2.3+
Reporter: Kevin (Sangwoo) Kim

 In Spark 1.2 built with Hadoop 2.3+, 
 unable to get files from AWS S3. 
 Same codes works well with same setup in Spark built with Hadoop 2.2-.
 I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there 
 might be an issue with it.
 ===
 scala sc.textFile(s3n://logs/log.2014-12-05.gz).count
 15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with 
 curMem=0, maxMem=27783541555
 15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in 
 memory (estimated size 102.1 KB, free 25.9 GB)
 java.lang.NullPointerException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
   at 
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157)
   at org.apache.spark.rdd.RDD.count(RDD.scala:904)
   at $iwC$$iwC$$iwC$$iwC.init(console:13)
   at $iwC$$iwC$$iwC.init(console:18)
   at $iwC$$iwC.init(console:20)
   at $iwC.init(console:22)
   at init(console:24)
   at .init(console:28)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Updated] (SPARK-4229) Create hadoop configuration in a consistent way


 [ 
https://issues.apache.org/jira/browse/SPARK-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4229:
-
Component/s: Spark Core

 Create hadoop configuration in a consistent way
 ---

 Key: SPARK-4229
 URL: https://issues.apache.org/jira/browse/SPARK-4229
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Cody Koeninger
Priority: Minor

 Some places use SparkHadoopUtil.get.conf, some create a new hadoop config.  
 Prefer SparkHadoopUtil so that spark.hadoop.* properties are pulled in.
 http://apache-spark-developers-list.1001551.n3.nabble.com/Hadoop-configuration-for-checkpointing-td9084.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4540) Improve Executor ID Logging


 [ 
https://issues.apache.org/jira/browse/SPARK-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4540:
-
Component/s: Spark Core

 Improve Executor ID Logging
 ---

 Key: SPARK-4540
 URL: https://issues.apache.org/jira/browse/SPARK-4540
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Arun Ahuja
Priority: Minor

 A few things that would useful here:
 - An executor should log what executor it is running, AFAICT this does not 
 help and only the driver reports that executor 10 is running on xyz.host.com
 - For YARN, when an executor fails in addition to reporting the executor ID 
 of the lost executor, report the container ID as well
 The latter is useful for multiple executors running on the same machine where 
 it may be more useful to find the container directly than the executor ID or 
 host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4321) Make Kryo serialization work for closures


 [ 
https://issues.apache.org/jira/browse/SPARK-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4321:
-
Component/s: Spark Core

 Make Kryo serialization work for closures
 -

 Key: SPARK-4321
 URL: https://issues.apache.org/jira/browse/SPARK-4321
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jeff Hammerbacher





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip


 [ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4563:
-
Component/s: Deploy

 Allow spark driver to bind to different ip then advertise ip
 

 Key: SPARK-4563
 URL: https://issues.apache.org/jira/browse/SPARK-4563
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Long Nguyen
Priority: Minor

 Spark driver bind ip and advertise is not configurable. spark.driver.host is 
 only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
 to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-5412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5412:
-
Component/s: Deploy

 Cannot bind Master to a specific hostname as per the documentation
 --

 Key: SPARK-5412
 URL: https://issues.apache.org/jira/browse/SPARK-5412
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Alexis Seigneurin

 Documentation on http://spark.apache.org/docs/latest/spark-standalone.html 
 indicates:
 {quote}
 You can start a standalone master server by executing:
 ./sbin/start-master.sh
 ...
 the following configuration options can be passed to the master and worker:
 ...
 -h HOST, --host HOST  Hostname to listen on
 {quote}
 The \-h or --host parameter actually doesn't work with the 
 start-master.sh script. Instead, one has to set the SPARK_MASTER_IP 
 variable prior to executing the script.
 Either the script or the documentation should be updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task


 [ 
https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5360:
-
Component/s: Spark Core

 For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are 
 included twice in serialized task
 

 Key: SPARK-5360
 URL: https://issues.apache.org/jira/browse/SPARK-5360
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor

 CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
 the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
 The partition is serialized separately from the RDD, so when the RDD and 
 partition arrive on the worker, the references in the partition and in the 
 RDD no longer point to the same object.
 This is a relatively minor performance issue (the closure can be 2x larger 
 than it needs to be because the rdds and partitions are serialized twice; see 
 numbers below) but is more annoying as a developer issue (this is where I ran 
 into): if any state is stored in the RDD or ShuffleHandle on the worker side, 
 subtle bugs can appear due to the fact that the references to the RDD / 
 ShuffleHandle in the RDD and in the partition point to separate objects.  I'm 
 not sure if this is enough of a potential future problem to fix this old and 
 central part of the code, so hoping to get input from others here.
 I did some simple experiments to see how much this effects closure size.  For 
 this example: 
 $ val a = sc.parallelize(1 to 10).map((_, 1))
 $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x))
 $ a.cogroup(b).collect()
 the closure was 1902 bytes with current Spark, and 1129 bytes after my 
 change.  The difference comes from eliminating duplicate serialization of the 
 shuffle handle.
 For this example:
 $ val sortedA = a.sortByKey()
 $ val sortedB = b.sortByKey()
 $ sortedA.cogroup(sortedB).collect()
 the closure was 3491 bytes with current Spark, and 1333 bytes after my 
 change. Here, the difference comes from eliminating duplicate serialization 
 of the two RDDs for the narrow dependencies.
 The ShuffleHandle includes the ShuffleDependency, so this difference will get 
 larger if a ShuffleDependency includes a serializer, a key ordering, or an 
 aggregator (all set to None by default).  However, the difference is not 
 affected by the size of the function the user specifies, which (based on my 
 understanding) is typically the source of large task closures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it


 [ 
https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5423:
-
Component/s: Shuffle

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it 
 ---

 Key: SPARK-5423
 URL: https://issues.apache.org/jira/browse/SPARK-5423
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Shixiong Zhu
Priority: Minor

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it.
 There is already a TODO in the comment:
 {code}
 // TODO: Ensure this gets called even if the iterator isn't drained.
 private def cleanup() {
   batchIndex = batchOffsets.length  // Prevent reading any other batch
   val ds = deserializeStream
   deserializeStream = null
   fileStream = null
   ds.close()
   file.delete()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3132:
-
Component/s: Spark Core

 Avoid serialization for Array[Byte] in TorrentBroadcast
 ---

 Key: SPARK-3132
 URL: https://issues.apache.org/jira/browse/SPARK-3132
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Davies Liu

 If the input data is a byte array, we should allow TorrentBroadcast to skip 
 serializing and compressing the input.
 To do this, we should add a new parameter (shortCircuitByteArray) to 
 TorrentBroadcast, and then avoid serialization in if the input is byte array 
 and shortCircuitByteArray is true.
 We should then also do compression in task serialization itself instead of 
 doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced


 [ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1866:
-
Component/s: Spark Core

 Closure cleaner does not null shadowed fields when outer scope is referenced
 

 Key: SPARK-1866
 URL: https://issues.apache.org/jira/browse/SPARK-1866
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Kan Zhang
Priority: Critical

 Take the following example:
 {code}
 val x = 5
 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */
 sc.parallelize(0 until 10).map { _ =
   val instances = 3
   (instances, x)
 }.collect
 {code}
 This produces a java.io.NotSerializableException: 
 org.apache.hadoop.fs.Path, despite the fact that the outer instances is not 
 actually used within the closure. If you change the name of the outer 
 variable instances to something else, the code executes correctly, indicating 
 that it is the fact that the two variables share a name that causes the issue.
 Additionally, if the outer scope is not used (i.e., we do not reference x 
 in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API


 [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3039.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4315
[https://github.com/apache/spark/pull/4315]

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0, 1.2.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy
Priority: Critical
 Fix For: 1.3.0


 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed

[
https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311264#comment-14311264
]

Apache Spark commented on SPARK-5668:
-

User 'MiguelPeralvo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4457

spark_ec2.py region parameter could be either mandatory or its value displayed
--

Key: SPARK-5668
URL: https://issues.apache.org/jira/browse/SPARK-5668
Project: Spark
Issue Type: Improvement
Components: EC2
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Miguel Peralvo
Priority: Minor
Labels: starter

If the region parameter is not specified when invoking spark-ec2
(spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster
doesn't belong to that region, after showing the Searching for existing
cluster Spark... message, it causes an ERROR: Could not find any existing
cluster exception because it doesn't find you cluster in the default region.
As it doesn't tell you anything about the region, It can be a small headache
for new users.
In
http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster,
Dmitriy Selivanov explains it.
I propose that:
1. Either we make the search message a little bit more informative with
something like Searching for existing cluster Spark in region +
opts.region.
2. Or we remove the us-east-1 as default and make the --region parameter
mandatory.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4440) Enhance the job progress API to expose more information


 [ 
https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4440:
-
Component/s: Spark Core

 Enhance the job progress API to expose more information
 ---

 Key: SPARK-4440
 URL: https://issues.apache.org/jira/browse/SPARK-4440
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Rui Li

 The progress API introduced in SPARK-2321 provides a new way for user to 
 monitor job progress. However the information exposed in the API is 
 relatively limited. It'll be much more useful if we can enhance the API to 
 expose more data.
 Some improvement for example may include but not limited to:
 1. Stage submission and completion time.
 2. Task metrics.
 The requirement is initially identified for the hive on spark 
 project(HIVE-7292), other application should benefit as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


 [ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-761:

Component/s: Spark Core
Description: I don't know how possible this is, as incompatibilities 
manifest in many and low-level ways.

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia

 I don't know how possible this is, as incompatibilities manifest in many and 
 low-level ways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-672) Executor gets stuck in a zombie state after running out of memory


 [ 
https://issues.apache.org/jira/browse/SPARK-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-672:

Component/s: Spark Core

 Executor gets stuck in a zombie state after running out of memory
 ---

 Key: SPARK-672
 URL: https://issues.apache.org/jira/browse/SPARK-672
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Mikhail Bautin
 Attachments: executor_jstack.txt, executor_stderr.txt, 
 standalone_worker_jstack.txt


 As a result of running a workload, an executor ran out of memory, but the 
 executor process stayed up. Also (not sure this is related) the standalone 
 worker process stayed up but disappeared from the master web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


 [ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-761:

  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor

 Not sure what component this falls under, or if this is still an issue.
 Patrick Wendell / Matei Zaharia?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-672) Executor gets stuck in a zombie state after running out of memory


 [ 
https://issues.apache.org/jira/browse/SPARK-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-672.
-
Resolution: Duplicate

The right-er answer is to fail for lack of memory faster, per SPARK-1989.

 Executor gets stuck in a zombie state after running out of memory
 ---

 Key: SPARK-672
 URL: https://issues.apache.org/jira/browse/SPARK-672
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Mikhail Bautin
 Attachments: executor_jstack.txt, executor_stderr.txt, 
 standalone_worker_jstack.txt


 As a result of running a workload, an executor ran out of memory, but the 
 executor process stayed up. Also (not sure this is related) the standalone 
 worker process stayed up but disappeared from the master web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections


 [ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-704:

Component/s: Spark Core

 ConnectionManager sometimes cannot detect loss of sending connections
 -

 Key: SPARK-704
 URL: https://issues.apache.org/jira/browse/SPARK-704
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Charles Reiss
Assignee: Henry Saputra

 ConnectionManager currently does not detect when SendingConnections 
 disconnect except if it is trying to send through them. As a result, a node 
 failure just after a connection is initiated but before any acknowledgement 
 messages can be sent may result in a hang.
 ConnectionManager has code intended to detect this case by detecting the 
 failure of a corresponding ReceivingConnection, but this code assumes that 
 the remote host:port of the ReceivingConnection is the same as the 
 ConnectionManagerId, which is almost never true. Additionally, there does not 
 appear to be any reason to assume a corresponding ReceivingConnection will 
 exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5065) BroadCast can still work after sc had been stopped.


 [ 
https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5065:
-
Component/s: Spark Core
   Priority: Minor  (was: Major)

 BroadCast can still work after sc had been stopped.
 ---

 Key: SPARK-5065
 URL: https://issues.apache.org/jira/browse/SPARK-5065
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus
Priority: Minor

 Code as follow:
 {code:borderStyle=solid}
 val sc1 = new SparkContext
 val sc2 = new SparkContext
 sc1.stop
 sc1.broadcast(1)
 {code}
 It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
 To fix it, throw a sparkException when broadCastManager had stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5332) Efficient way to deal with ExecutorLost


 [ 
https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5332:
-
Component/s: Spark Core
   Priority: Minor  (was: Major)

 Efficient way to deal with ExecutorLost
 ---

 Key: SPARK-5332
 URL: https://issues.apache.org/jira/browse/SPARK-5332
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Liang-Chi Hsieh
Priority: Minor

 Currently, the handler of the case when an executor being lost in 
 DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a 
 bit of extra information to Stage class to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4895) Support a shared RDD store among different Spark contexts

[
https://issues.apache.org/jira/browse/SPARK-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-4895.
--
Resolution: Duplicate

Since I don't see additional work here, and this covers almost exactly the same
ground as SPARK-2389, and I don't imagine Spark Core will do anything to share
RDDs that Tachyon isn't already providing, I think this should be closed.

Support a shared RDD store among different Spark contexts
-

Key: SPARK-4895
URL: https://issues.apache.org/jira/browse/SPARK-4895
Project: Spark
Issue Type: New Feature
Reporter: Zane Hu

It seems a valid requirement to allow jobs from different Spark contexts to
share RDDs. It would be limited if we only allow sharing RDDs within a
SparkContext, as in Ooyala (SPARK-818). A more generic way for collaboration
among jobs from different Spark contexts is to support a shared RDD store
managed by a RDD store master and workers running in separate processes from
SparkContext and executor JVMs. This shared RDD store doesn't do any RDD
transformations, but accepts requests from jobs of different Spark contexts
to read and write shared RDDs in memory or on disks on distributed machines,
and manages the life cycle of these RDDs.
Tachyon might be used for sharing data in this case. But I think Tachyon is
more designed as an in-memory distributed file system for any applications,
not only for RDDs and Spark.
If people agree, I may draft out a design document for further discussions.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4681) Turn on host level blacklisting by default


 [ 
https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4681:
-
Component/s: Scheduler

 Turn on host level blacklisting by default
 --

 Key: SPARK-4681
 URL: https://issues.apache.org/jira/browse/SPARK-4681
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Patrick Wendell
Assignee: Davies Liu

 Per discussion in https://github.com/apache/spark/pull/3541.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5311) EventLoggingListener throws exception if log directory does not exist


 [ 
https://issues.apache.org/jira/browse/SPARK-5311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5311:
-
Component/s: Spark Core

 EventLoggingListener throws exception if log directory does not exist
 -

 Key: SPARK-5311
 URL: https://issues.apache.org/jira/browse/SPARK-5311
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Josh Rosen
Priority: Blocker

 If the log directory does not exist, EventLoggingListener throws an 
 IllegalArgumentException.  Here's a simple reproduction (using the master 
 branch (1.3.0)):
 {code}
 ./bin/spark-shell --conf spark.eventLog.enabled=true --conf 
 spark.eventLog.dir=/tmp/nonexistent-dir
 {code}
 where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. 
  This results in the following exception:
 {code}
 15/01/18 17:10:44 INFO HttpServer: Starting HTTP Server
 15/01/18 17:10:44 INFO Utils: Successfully started service 'HTTP file server' 
 on port 62729.
 15/01/18 17:10:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
 Attempting port 4041.
 15/01/18 17:10:44 INFO Utils: Successfully started service 'SparkUI' on port 
 4041.
 15/01/18 17:10:44 INFO SparkUI: Started SparkUI at 
 http://joshs-mbp.att.net:4041
 15/01/18 17:10:45 INFO Executor: Using REPL class URI: 
 http://192.168.1.248:62726
 15/01/18 17:10:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
 akka.tcp://sparkdri...@joshs-mbp.att.net:62728/user/HeartbeatReceiver
 15/01/18 17:10:45 INFO NettyBlockTransferService: Server created on 62730
 15/01/18 17:10:45 INFO BlockManagerMaster: Trying to register BlockManager
 15/01/18 17:10:45 INFO BlockManagerMasterActor: Registering block manager 
 localhost:62730 with 265.4 MB RAM, BlockManagerId(driver, localhost, 62730)
 15/01/18 17:10:45 INFO BlockManagerMaster: Registered BlockManager
 java.lang.IllegalArgumentException: Log directory /tmp/nonexistent-dir does 
 not exist.
   at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:90)
   at org.apache.spark.SparkContext.init(SparkContext.scala:363)
   at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:270)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:147)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:60)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:962)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at

[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4783:
-
Component/s: Spark Core

 System.exit() calls in SparkContext disrupt applications embedding Spark
 

 Key: SPARK-4783
 URL: https://issues.apache.org/jira/browse/SPARK-4783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: David Semeria

 A common architectural choice for integrating Spark within a larger 
 application is to employ a gateway to handle Spark jobs. The gateway is a 
 server which contains one or more long-running sparkcontexts.
 A typical server is created with the following pseudo code:
 var continue = true
 while (continue){
  try {
 server.run() 
   } catch (e) {
   continue = log_and_examine_error(e)
 }
 The problem is that sparkcontext frequently calls System.exit when it 
 encounters a problem which means the server can only be re-spawned at the 
 process level, which is much more messy than the simple code above.
 Therefore, I believe it makes sense to replace all System.exit calls in 
 sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4723) To abort the stages which have attempted some times


 [ 
https://issues.apache.org/jira/browse/SPARK-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4723:
-
Component/s: Scheduler

 To abort the stages which have attempted some times
 ---

 Key: SPARK-4723
 URL: https://issues.apache.org/jira/browse/SPARK-4723
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Priority: Minor

 For some reason, some stages may attempt many times. A threshold could be 
 added and the stages which have attempted more than the threshold could be 
 aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1346) Backport SPARK-1210 into 0.9 branch


 [ 
https://issues.apache.org/jira/browse/SPARK-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1346:
-
Component/s: Spark Core

 Backport SPARK-1210 into 0.9 branch
 ---

 Key: SPARK-1346
 URL: https://issues.apache.org/jira/browse/SPARK-1346
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Tathagata Das
  Labels: backport-needed

 We should backport this after the 0.9.1 release happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2958) FileClientHandler should not be shared in the pipeline


 [ 
https://issues.apache.org/jira/browse/SPARK-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2958:
-
Component/s: Spark Core

 FileClientHandler should not be shared in the pipeline
 --

 Key: SPARK-2958
 URL: https://issues.apache.org/jira/browse/SPARK-2958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 Netty module creates a single FileClientHandler and shares it in all threads. 
 We should create a new one for each pipeline thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-839) Bug in how failed executors are removed by ID from standalone cluster


 [ 
https://issues.apache.org/jira/browse/SPARK-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-839:

Component/s: Spark Core

 Bug in how failed executors are removed by ID from standalone cluster
 -

 Key: SPARK-839
 URL: https://issues.apache.org/jira/browse/SPARK-839
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.0, 0.7.3
Reporter: Mark Hamstra
Priority: Critical

 ClearStory data reported the following issue, where some hashmaps are indexed 
 by executorId and some by appId/executorId, and we use the wrong string to 
 search for an executor: https://github.com/clearstorydata/spark/pull/9. This 
 affects FT on the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3965) Spark assembly for hadoop2 contains avro-mapred for hadoop1


 [ 
https://issues.apache.org/jira/browse/SPARK-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3965.
--
Resolution: Duplicate

 Spark assembly for hadoop2 contains avro-mapred for hadoop1
 ---

 Key: SPARK-3965
 URL: https://issues.apache.org/jira/browse/SPARK-3965
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.0.2, 1.1.0, 1.2.0
 Environment: hadoop2, HDP2.1
Reporter: David Jacot

 When building Spark assembly for hadoop2, org.apache.avro:avro-mapred for 
 hadoop1 is picked and added to the assembly which leads to following 
 exception at runtime.
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 ...
 {code}
 The patch for SPARK-3039 works well at compile time but artefact's classifier 
 is not applied when assembly is built. I'm not a maven expert but I don't 
 think that classifiers are applied on transitive dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5334) NullPointerException when getting files from S3 (hadoop 2.3+)

2015-02-08 Thread Kevin (Sangwoo) Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311324#comment-14311324
 ] 

Kevin (Sangwoo) Kim commented on SPARK-5334:


[~srowen] Oh thanks! I'll test it.

 NullPointerException when getting files from S3 (hadoop 2.3+)
 -

 Key: SPARK-5334
 URL: https://issues.apache.org/jira/browse/SPARK-5334
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
 Environment: Spark 1.2 built with Hadoop 2.3+
Reporter: Kevin (Sangwoo) Kim

 In Spark 1.2 built with Hadoop 2.3+, 
 unable to get files from AWS S3. 
 Same codes works well with same setup in Spark built with Hadoop 2.2-.
 I saw that jets3t version changed in profile with Hadoop 2.3+, I guess there 
 might be an issue with it.
 ===
 scala sc.textFile(s3n://logs/log.2014-12-05.gz).count
 15/01/20 11:22:40 INFO MemoryStore: ensureFreeSpace(104533) called with 
 curMem=0, maxMem=27783541555
 15/01/20 11:22:40 INFO MemoryStore: Block broadcast_2 stored as values in 
 memory (estimated size 102.1 KB, free 25.9 GB)
 java.lang.NullPointerException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
   at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
   at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
   at 
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1157)
   at org.apache.spark.rdd.RDD.count(RDD.scala:904)
   at $iwC$$iwC$$iwC$$iwC.init(console:13)
   at $iwC$$iwC$$iwC.init(console:18)
   at $iwC$$iwC.init(console:20)
   at $iwC.init(console:22)
   at init(console:24)
   at .init(console:28)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

[jira] [Updated] (SPARK-5225) Support coalesed Input Metrics from different sources


 [ 
https://issues.apache.org/jira/browse/SPARK-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5225:
-
Component/s: Spark Core

 Support coalesed Input Metrics from different sources
 -

 Key: SPARK-5225
 URL: https://issues.apache.org/jira/browse/SPARK-5225
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kostas Sakellis

 Currently, If task reads data from more than one block and it is from 
 different read methods we ignore the second read method bytes. For example:
 {noformat}
   CoalescedRDD
| 
  Task1 
  /  |  \   
  hadoop  hadoop  cached
 {noformat}
 if Task1 starts reading from the hadoop blocks first, then the input metrics 
 for Task1 will only contain input metrics from the hadoop blocks and ignre 
 the input metrics from cached blocks. We need to change the way we collect 
 input metrics so that it is not a single value but rather a collection of 
 input metrics for a task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage


 [ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2666:
-
Component/s: Spark Core

 when task is FetchFailed cancel running tasks of failedStage
 

 Key: SPARK-2666
 URL: https://issues.apache.org/jira/browse/SPARK-2666
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang

 in DAGScheduler's handleTaskCompletion,when reason of failed task is 
 FetchFailed, cancel running tasks of failedStage before add failedStage to 
 failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4087) Only use broadcast for large tasks


 [ 
https://issues.apache.org/jira/browse/SPARK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4087:
-
Component/s: Spark Core

 Only use broadcast for large tasks
 --

 Key: SPARK-4087
 URL: https://issues.apache.org/jira/browse/SPARK-4087
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Critical

 After we broadcast every tasks, some regressions are introduced because of 
 broadcast is not stable enough.
 So we would like to only use broadcast for large tasks, which will keep the 
 same behaviour as 1.0 for most of the cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4059) spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST


 [ 
https://issues.apache.org/jira/browse/SPARK-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4059:
-
Priority: Minor  (was: Major)

It's not clear what this means?

 spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
 --

 Key: SPARK-4059
 URL: https://issues.apache.org/jira/browse/SPARK-4059
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Guo Ruijing
Priority: Minor

 spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
 existing implemetation:
 spark-maste uses SPARK_MASTER_IP and spark-worker uses 
 STANDALONE_SPARK_MASTER_HOST.
 proposal implementatioin:
 spark-master/spark-worker may use SPARK_MASTER_IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4059) spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST


 [ 
https://issues.apache.org/jira/browse/SPARK-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4059:
-
Component/s: Deploy

 spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
 --

 Key: SPARK-4059
 URL: https://issues.apache.org/jira/browse/SPARK-4059
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Guo Ruijing
Priority: Minor

 spark-master/spark-worker may use SPARK_MASTER_IP/STANDALONE_SPARK_MASTER_HOST
 existing implemetation:
 spark-maste uses SPARK_MASTER_IP and spark-worker uses 
 STANDALONE_SPARK_MASTER_HOST.
 proposal implementatioin:
 spark-master/spark-worker may use SPARK_MASTER_IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4539) History Server counts incomplete applications against the retainedApplications total, fails to show eligible completed applications


 [ 
https://issues.apache.org/jira/browse/SPARK-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4539:
-
Component/s: Spark Core

 History Server counts incomplete applications against the 
 retainedApplications total, fails to show eligible completed applications
 -

 Key: SPARK-4539
 URL: https://issues.apache.org/jira/browse/SPARK-4539
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams

 I have observed the history server to return 0 or 1 applications from a 
 directory that contains many complete and incomplete applications (the latter 
 being application directories that are missing the {{APPLICATION_COMPLETE}} 
 file).
 Without having dug too much, my theory is that HistoryServer is seeing the 
 incomplete directories and counting them against the 
 {{retainedApplications}} maximum but not displaying them.
 One supporting anecdote for this is that I loaded HS against a directory that 
 had one complete application and nothing else, and HS worked as expected (I 
 saw the one application in the web UI).
 I then copied ~100 other application directories in, the majority of which 
 were incomplete (in particular, most of the ones that had the earliest 
 timestamps), and still only saw the one original completed application via 
 the web UI.
 Finally, I restarted the same server with the {{retainedApplications}} set to 
 1000 (instead of 50; the directory a this point had ~10 completed 
 applications and 90 incomplete ones), and saw all/exactly the completed 
 applications, leading me to believe that they were being boxed out of the 
 maximum-50-retained-applications iteration of the history server.
 Silently failing on incomplete directories while still docking the count, 
 if that is indeed what is happening, is a pretty confusing failure mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4560) Lambda deserialization error


 [ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4560:
-
Component/s: Spark Core

 Lambda deserialization error
 

 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.1.1
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin
 Attachments: IndexTweets.java, pom.xml


 I'm getting an error saying a lambda could not be deserialized. Here is the 
 code:
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .map(t - t.getText())
 .foreachRDD(tweets - {
 tweets.foreach(x - System.out.println(x));
 return null;
 });
 {code}
 Here is the exception:
 {noformat}
 java.io.IOException: unexpected exception type
   at 
 java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
   ... 27 more
 Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
   at 
 com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
   ... 37 more
 {noformat}
 The weird thing is, if I write the following code (the map operation is 
 inside the foreachRDD), it works without problem.
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .foreachRDD(tweets - {
 tweets.map(t - t.getText())
 .foreach(x - System.out.println(x));
 return null;
 });
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)


 [ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5594:
-
Component/s: Spark Core

 SparkException: Failed to get broadcast (TorrentBroadcast)
 --

 Key: SPARK-5594
 URL: https://issues.apache.org/jira/browse/SPARK-5594
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: John Sandiford
Priority: Critical

 I am uncertain whether this is a bug, however I am getting the error below 
 when running on a cluster (works locally), and have no idea what is causing 
 it, or where to look for more information.
 Any help is appreciated.  Others appear to experience the same issue, but I 
 have not found any solutions online.
 Please note that this only happens with certain code and is repeatable, all 
 my other spark jobs work fine.
 ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
 Lost task 3.3 in stage 6.0 (TID 24, removed): java.io.IOException: 
 org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
 broadcast_6
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
 of broadcast_6
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
 at 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
 ... 11 more
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)

[jira] [Updated] (SPARK-2319) Number of tasks on executors become negative after executor failures


 [ 
https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2319:
-
Component/s: Web UI

 Number of tasks on executors become negative after executor failures
 

 Key: SPARK-2319
 URL: https://issues.apache.org/jira/browse/SPARK-2319
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Andrew Or
 Attachments: num active tasks become negative (-16).jpg


 See attached screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2520) the executor is thrown java.io.StreamCorruptedException


 [ 
https://issues.apache.org/jira/browse/SPARK-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2520:
-
Component/s: Shuffle

 the executor is thrown java.io.StreamCorruptedException
 ---

 Key: SPARK-2520
 URL: https://issues.apache.org/jira/browse/SPARK-2520
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Critical

 This issue occurs with a very small probability. I can not reproduce it.
 The executor log: 
 {code}
 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
 output locations for shuffle 0 to spark@sanshan:34429
 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
 output locations for shuffle 0 to spark@sanshan:31934
 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
 output locations for shuffle 0 to spark@sanshan:30557
 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
 output locations for shuffle 0 to spark@sanshan:42606
 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
 output locations for shuffle 0 to spark@sanshan:37314
 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:166 as TID 
 4948 on executor 20: tuan221 (PROCESS_LOCAL)
 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:166 as 
 3129 bytes in 1 ms
 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4868 (task 0.0:86)
 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.StreamCorruptedException
 java.io.StreamCorruptedException: invalid type code: AC
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at 
 org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:86 as TID 
 4949 on executor 20: tuan221 (PROCESS_LOCAL)
 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:86 as 
 3129 bytes in 0 ms
 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4785 (task 0.0:3)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.


 [ 
https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5377:
-
Component/s: Spark Core

 Dynamically add jar into Spark Driver's classpath.
 --

 Key: SPARK-5377
 URL: https://issues.apache.org/jira/browse/SPARK-5377
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Chengxiang Li

 Spark support dynamically add jar to executor classpath through 
 SparkContext::addJar(), while it does not support dynamically add jar into 
 driver classpath. In most case(if not all the case), user dynamically add jar 
 with SparkContext::addJar()  because some classes from the jar would be 
 referred in upcoming Spark job, which means the classes need to be loaded in 
 Spark driver side either,e.g during serialization. I think it make sense to 
 add an API to add jar into driver classpath, or just make it available in 
 SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

[
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-5113:
-
Component/s: Spark Core

Audit and document use of hostnames and IP addresses in Spark
-

Key: SPARK-5113
URL: https://issues.apache.org/jira/browse/SPARK-5113
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical

Spark has multiple network components that start servers and advertise their
network addresses to other processes.
We should go through each of these components and make sure they have
consistent and/or documented behavior wrt (a) what interface(s) they bind to
and (b) what hostname they use to advertise themselves to other processes. We
should document this clearly and explain to people what to do in different
cases (e.g. EC2, dockerized containers, etc).
When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for
a hostname associated with that interface. Then the network components will
use that hostname to advertise the component to other processes. That
hostname is also the one used for the akka system identifier (akka supports
only supplying a single name which it uses both as the bind interface and as
the actor identifier). In some cases, that hostname is used as the bind
hostname also (e.g. I think this happens in the connection manager and
possibly akka) - which will likely internally result in a re-resolution of
this to an IP address. In other cases (the web UI and netty shuffle) we seem
to bind to all interfaces.
The best outcome would be to have three configs that can be set on each
machine:
{code}
SPARK_LOCAL_IP # Ip address we bind to for all services
SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within
the cluster
SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the
cluster (e.g. the UI)
{code}
It's not clear how easily we can support that scheme while providing
backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy -
it's just an alias for what is now SPARK_PUBLIC_DNS.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2913) Spark's log4j.properties should always appear ahead of Hadoop's on classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2913:
-
Component/s: Deploy

 Spark's log4j.properties should always appear ahead of Hadoop's on classpath
 

 Key: SPARK-2913
 URL: https://issues.apache.org/jira/browse/SPARK-2913
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.2, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 In the current {{compute-classpath}} scripts, the Hadoop conf directory may 
 appear before Spark's conf directory in the computed classpath.  This leads 
 to Hadoop's log4j.properties being used instead of Spark's, preventing users 
 from easily changing Spark's logging settings.
 To fix this, we should add a new classpath entry for Spark's log4j.properties 
 file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4351) Record cacheable RDD reads and display RDD miss rates


 [ 
https://issues.apache.org/jira/browse/SPARK-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4351:
-
Component/s: Spark Core

 Record cacheable RDD reads and display RDD miss rates
 -

 Key: SPARK-4351
 URL: https://issues.apache.org/jira/browse/SPARK-4351
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Charles Reiss
Priority: Minor

 Currently, when Spark fails to keep an RDD cached, there is little visibility 
 to the user (beyond performance effects), especially if the user is not 
 reading executor logs. We could expose this information to the Web UI and the 
 event log like we do for RDD storage information by reporting RDD reads and 
 their results with task metrics.
 From this, live computation of RDD miss rates is straightforward, and 
 information in the event log would enable more complicated post-hoc analyses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications


 [ 
https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4605:
-
Component/s: Project Infra

 Proposed Contribution: Spark Kernel to enable interactive Spark applications
 

 Key: SPARK-4605
 URL: https://issues.apache.org/jira/browse/SPARK-4605
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Chip Senkbeil
 Attachments: Kernel Architecture Widescreen.pdf, Kernel 
 Architecture.pdf


 Project available on Github: https://github.com/ibm-et/spark-kernel
 
 This architecture is describing running kernel code that was demonstrated at 
 the StrataConf in Barcelona, Spain.
 
 Enables applications to interact with a Spark cluster using Scala in several 
 ways:
 * Defining and running core Spark Tasks
 * Collecting results from a cluster without needing to write to external data 
 store
 ** Ability to stream results using well-defined protocol
 * Arbitrary Scala code definition and execution (without submitting 
 heavy-weight jars)
 Applications can be hosted and managed separate from the Spark cluster using 
 the kernel as a proxy to communicate requests.
 The Spark Kernel implements the server side of the IPython Kernel protocol, 
 the rising “de-facto” protocol for language (Python, Haskell, etc.) execution.
 Inherits a suite of industry adopted clients such as the IPython Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition


 [ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5581:
-
Component/s: Shuffle

 When writing sorted map output file, avoid open / close between each partition
 --

 Key: SPARK-5581
 URL: https://issues.apache.org/jira/browse/SPARK-5581
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.3.0
Reporter: Sandy Ryza

 {code}
   // Bypassing merge-sort; get an iterator by partition and just write 
 everything directly.
   for ((id, elements) - this.partitionedIterator) {
 if (elements.hasNext) {
   val writer = blockManager.getDiskWriter(
 blockId, outputFile, ser, fileBufferSize, 
 context.taskMetrics.shuffleWriteMetrics.get)
   for (elem - elements) {
 writer.write(elem)
   }
   writer.commitAndClose()
   val segment = writer.fileSegment()
   lengths(id) = segment.length
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5607) NullPointerException in objenesis


 [ 
https://issues.apache.org/jira/browse/SPARK-5607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5607:
-
Component/s: Tests

 NullPointerException in objenesis
 -

 Key: SPARK-5607
 URL: https://issues.apache.org/jira/browse/SPARK-5607
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Reynold Xin
Assignee: Patrick Wendell
 Fix For: 1.3.0


 Tests are sometimes failing with the following exception.
 The problem might be that Kryo is using a different version of objenesis from 
 Mockito.
 {code}
 [info] - Process succeeds instantly *** FAILED *** (107 milliseconds)
 [info]   java.lang.NullPointerException:
 [info]   at 
 org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52)
 [info]   at 
 org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90)
 [info]   at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
 [info]   at 
 org.mockito.internal.creation.jmock.ClassImposterizer.createProxy(ClassImposterizer.java:111)
 [info]   at 
 org.mockito.internal.creation.jmock.ClassImposterizer.imposterise(ClassImposterizer.java:51)
 [info]   at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:52)
 [info]   at org.mockito.internal.MockitoCore.mock(MockitoCore.java:41)
 [info]   at org.mockito.Mockito.mock(Mockito.java:1014)
 [info]   at org.mockito.Mockito.mock(Mockito.java:909)
 [info]   at 
 org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply$mcV$sp(DriverRunnerTest.scala:50)
 [info]   at 
 org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47)
 [info]   at 
 org.apache.spark.deploy.worker.DriverRunnerTest$$anonfun$1.apply(DriverRunnerTest.scala:47)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
 [info]   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
 [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
 [info]   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
 [info]   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
 [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]   at java.lang.Thread.run(Thread.java:745)
 {code}
 More

[jira] [Updated] (SPARK-5654) Integrate SparkR into Apache Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5654:
-
Component/s: Project Infra

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5319) Choosing partition size instead of count


 [ 
https://issues.apache.org/jira/browse/SPARK-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5319:
-
Component/s: Spark Core

 Choosing partition size instead of count
 

 Key: SPARK-5319
 URL: https://issues.apache.org/jira/browse/SPARK-5319
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Reporter: Idan Zalzberg

 With the current API, there are multiple locations when you can set the 
 partition count when reading from sources.
 However IME, it is sometimes useful to set the partition size (in MB), and 
 infer the count from that. 
 IME, spark is sensitive to the partition size, if they are too big, it raises 
 the amount of memory needed per core, and if they are too small then the 
 stage times increase significantly, so I'd like to stay in the sweet spot 
 of the partition size, without trying to change the partition count around 
 until I find it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5340) Spark startup in local mode should not always create HTTP file server


 [ 
https://issues.apache.org/jira/browse/SPARK-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5340.
--
Resolution: Won't Fix

Per PR discussion, WontFix.

 Spark startup in local mode should not always create HTTP file server
 -

 Key: SPARK-5340
 URL: https://issues.apache.org/jira/browse/SPARK-5340
 Project: Spark
  Issue Type: Improvement
Reporter: Paul R. Brown

 In particular, I don't want the HTTP file server.  The ui and other 
 components can be disabled via configuration parameters, and the HTTP file 
 server should receive similar treatment (IMHO).
 Created PR to just never create it in local mode:
 https://github.com/apache/spark/pull/4125



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5616) Add examples for PySpark API

2015-02-08 Thread dongxu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-5616:
--
Description: 
PySpark API examples are less than Spark scala API. For example:  

1.Broadcast: how to use broadcast operation API
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.

  was:
PySpark API examples are less than Spark scala API. For example:  

1.Boardcast: how to use boardcast operation APi 
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.


 Add examples for PySpark API
 

 Key: SPARK-5616
 URL: https://issues.apache.org/jira/browse/SPARK-5616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: dongxu
Priority: Minor
  Labels: examples, pyspark, python

 PySpark API examples are less than Spark scala API. For example:  
 1.Broadcast: how to use broadcast operation API
 2.Module: how to import a other python file in zip file.
 Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core

[
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311447#comment-14311447
]

DeepakVohra commented on SPARK-5625:

Extracting/opening with WinZip is only to verify the archive is valid.

The following indicate that the spark assembly jar is not a valid archive.

1. Even though the assembly jar is in the classpath, a Spark application does
not find the classes in the assembly jar.
2. The assembly jar does not get opened/extracted with WinZip which generates
the error:
http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0

All indicators suggest the assembly jar is not a valid archive. Adding a Spark
core artifact jar to the same directory, the lib directory of Spark binaries,
adds the classes from the Spark Core to the classpath.

Could it be verified:

1. The assembly jar gets extracted and is a valid archive?
2. Adding the jar in the classpath adds the classes to classpath?

Spark binaries do not incude Spark Core
---

Key: SPARK-5625
URL: https://issues.apache.org/jira/browse/SPARK-5625
Project: Spark
Issue Type: Bug
Components: Java API
Affects Versions: 1.2.0
Environment: CDH4
Reporter: DeepakVohra

Spark binaries for CDH 4 do not include the Spark Core Jar.
http://spark.apache.org/downloads.html

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5625) Spark binaries do not incude Spark Core

[
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311447#comment-14311447
]

DeepakVohra edited comment on SPARK-5625 at 2/8/15 5:56 PM:

The jar tf does list the Spark classes, which verifies the Binaries include the
Spark artifact classes. The issue subject should be modified to:

Is the Spark Assembly a Valid Archive?

Extracting/opening with WinZip is only to verify the archive is valid.

The following indicate that the spark assembly jar is not a valid archive.

Could it be verified:

1. The assembly jar gets extracted and is a valid archive?
2. Adding the jar in the classpath adds the classes to classpath?

was (Author: dvohra):
Extracting/opening with WinZip is only to verify the archive is valid.

The following indicate that the spark assembly jar is not a valid archive.

Could it be verified:

1. The assembly jar gets extracted and is a valid archive?
2. Adding the jar in the classpath adds the classes to classpath?

Spark binaries do not incude Spark Core
---

Key: SPARK-5625
URL: https://issues.apache.org/jira/browse/SPARK-5625
Project: Spark
Issue Type: Bug
Components: Java API
Affects Versions: 1.2.0
Environment: CDH4
Reporter: DeepakVohra

Spark binaries for CDH 4 do not include the Spark Core Jar.
http://spark.apache.org/downloads.html

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


[ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311475#comment-14311475
 ] 

Patrick Wendell commented on SPARK-761:
---

I think the main thing to catch would be Akka. I.e. try connecting different 
versions and seeing what happens as an exploratory step. For instance, if akka 
has a standard exception which says you had an incompatible message type, we 
can wrap that and give an outer exception explaining that the spark version is 
likely wrong. So maybe we can see if someone wants to explore this a bit as a 
starter task.

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 Not sure what component this falls under, or if this is still an issue.
 Patrick Wendell / Matei Zaharia?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4687) SparkContext#addFile doesn't keep file folder information


 [ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4687:
---
Component/s: Spark Core

 SparkContext#addFile doesn't keep file folder information
 -

 Key: SPARK-4687
 URL: https://issues.apache.org/jira/browse/SPARK-4687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Jimmy Xiang
Assignee: Sandy Ryza
 Fix For: 1.3.0, 1.4.0


 Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
 a task starts. However, Utils#fetchFile puts all files under the Spart root 
 on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5299) Is http://www.apache.org/dist/spark/KEYS out of date?


 [ 
https://issues.apache.org/jira/browse/SPARK-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5299:
---
Component/s: (was: Deploy)
 Build

 Is http://www.apache.org/dist/spark/KEYS out of date?
 -

 Key: SPARK-5299
 URL: https://issues.apache.org/jira/browse/SPARK-5299
 Project: Spark
  Issue Type: Question
  Components: Build
Reporter: David Shaw
Assignee: Patrick Wendell

 The keys contained in http://www.apache.org/dist/spark/KEYS do not appear to 
 match the keys used to sign the releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal


 [ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3033:
---
Component/s: (was: Spark Core)

 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: pengyanhong

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


 [ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-761:
--
Labels: starter  (was: )

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 Not sure what component this falls under, or if this is still an issue.
 Patrick Wendell / Matei Zaharia?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


 [ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-761:
--
Description: As a starter task, it would be good to audit the current 
behavior for different client - server pairs with respect to how exceptions 
occur.  (was: Not sure what component this falls under, or if this is still an 
issue.
Patrick Wendell / Matei Zaharia?)

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 As a starter task, it would be good to audit the current behavior for 
 different client - server pairs with respect to how exceptions occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk

2015-02-08 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311485#comment-14311485
 ] 

Andrew Ash commented on SPARK-761:
--

Another thing could be a basic check for version number mismatches.  E.g. a 
warning log from both server and client could say: Version mismatch between 
server (1.2.0) and client (1.1.1); proceeding anyway

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 As a starter task, it would be good to audit the current behavior for 
 different client - server pairs with respect to how exceptions occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk


[ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311490#comment-14311490
 ] 

Patrick Wendell commented on SPARK-761:
---

[~aash] right now we don't explicitly encode the spark version anywhere in the 
RPC. The best possible thing is to give an explicit version number like you 
said, but we don't have the plumbing to do that at the moment and IMO that's 
worth punting until we decide to standardize the RPC format.

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 As a starter task, it would be good to audit the current behavior for 
 different client - server pairs with respect to how exceptions occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3242) Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default


[ 
https://issues.apache.org/jira/browse/SPARK-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311328#comment-14311328
 ] 

Apache Spark commented on SPARK-3242:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4458

 Spark 1.0.2 ec2 scripts creates clusters with Spark 1.0.1 installed by default
 --

 Key: SPARK-3242
 URL: https://issues.apache.org/jira/browse/SPARK-3242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal


 [ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3033:
-
Priority: Major  (was: Blocker)

 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311348#comment-14311348
 ] 

DeepakVohra commented on SPARK-5625:


The other jars in the Spark binaries lib directory get opened/extracted except 
the assembly jar. 

Could it be verified that the assembly jar gets extracted? And which extraction 
tool is used?

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.


 [ 
https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4869:
-
Component/s: (was: Spark Core)
 SQL
   Priority: Major  (was: Blocker)

 The variable names in IF statement of Spark SQL doesn't resolve to its value. 
 --

 Key: SPARK-4869
 URL: https://issues.apache.org/jira/browse/SPARK-4869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Ajay

 We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we 
 have to have nested “if” statements. But, spark sql is not able to resolve 
 the variable names in final evaluation but the literal values are working. An 
 Unresolved Attributes error is being thrown. Please fix this bug. 
 This works:
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,1) as ROLL_BACKWARD FROM OUTER_RDD)
 This doesn’t :
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5139) select table_alias.* with joins and selecting column names from inner queries not supported


 [ 
https://issues.apache.org/jira/browse/SPARK-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5139:
-
  Priority: Major  (was: Blocker)
Issue Type: Improvement  (was: Bug)

 select table_alias.* with joins  and selecting column names from inner 
 queries not supported
 

 Key: SPARK-5139
 URL: https://issues.apache.org/jira/browse/SPARK-5139
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
 Environment: Eclipse + SBT as well as linux cluster
Reporter: Sunita Koppar

 There are 2 issues here:
 1. select table_alias.*  on a joined query is not supported
 The exception thrown is as below:
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
 at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
 at croevss.WfPlsRej$.plsrej(WfPlsRej.scala:80)
 at croevss.WfPlsRej$.main(WfPlsRej.scala:40)
 at croevss.WfPlsRej.main(WfPlsRej.scala)
 2. Multilevel nesting chokes up with messages like this:
 Exception in thread main 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes:
 Below is a sample query which runs on hive, but fails due to the above 
 reasons with Spark SQL. 
 SELECT sq.* ,r.*
 FROM   (SELECT cs.*, 
w.primary_key, 
w.id  AS s_id1, 
w.d_cd, 
w.d_name, 
w.rd, 
w.completion_date AS completion_date1, 
w.sales_type  AS sales_type1 
 FROM   (SELECT stg.s_id, 
stg.c_id, 
stg.v, 
stg.flg1, 
stg.flg2, 
comstg.d1, 
comstg.d2, 
comstg.d3, 
 FROM   croe_rej_stage_pq stg 
JOIN croe_rej_stage_comments_pq comstg 
  ON ( stg.s_id = comstg.s_id ) 
 WHERE  comstg.valid_flg_txt = 'Y' 
AND stg.valid_flg_txt = 'Y' 
 ORDER  BY stg.s_id) cs 
JOIN croe_rej_work_pq w 
  ON ( cs.s_id = w.s_id )) sq 
JOIN CROE_rdr_pq r 
  ON ( sq.d_cd = r.d_number )
 This is very cumbersome to deal with and we end up creating StructTypes for 
 every level.
 If there is a better way to deal with this, please let us know
 regards
 Sunita



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311378#comment-14311378
 ] 

Sean Owen commented on SPARK-5625:
--

As I've said, the assembly is a JAR file. You do not extract it in order to use 
it; you don't extract any JAR file to use it. However it is just a zip file. 
{{jar xf}} and {{unzip}} both successfully extract it. But to be clear, you do 
not need to do so.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311347#comment-14311347
 ] 

DeepakVohra commented on SPARK-5625:


WinZIp version is the latest 18.5.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311348#comment-14311348
 ] 

DeepakVohra edited comment on SPARK-5625 at 2/8/15 3:21 PM:


The error is not too many files. The error is the archive is not valid as in 
the screenshot.
http://s763.photobucket.com/user/dvohra10/media/SparkAssembly_zps4319294c.jpg.html?o=0

The other jars in the Spark binaries lib directory get opened/extracted except 
the assembly jar. 

Could it be verified that the assembly jar gets extracted? And which extraction 
tool is used?


was (Author: dvohra):
The other jars in the Spark binaries lib directory get opened/extracted except 
the assembly jar. 

Could it be verified that the assembly jar gets extracted? And which extraction 
tool is used?

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5659) Flaky Test: org.apache.spark.streaming.ReceiverSuite.block


 [ 
https://issues.apache.org/jira/browse/SPARK-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5659:
---
Component/s: Tests

 Flaky Test: org.apache.spark.streaming.ReceiverSuite.block
 --

 Key: SPARK-5659
 URL: https://issues.apache.org/jira/browse/SPARK-5659
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Tathagata Das
Priority: Critical
  Labels: flaky-test

 {code}
 Error Message
 recordedBlocks.drop(1).dropRight(1).forall(((block: 
 scala.collection.mutable.ArrayBuffer[Int]) = 
 block.size.=(minExpectedMessagesPerBlock).(block.size.=(maxExpectedMessagesPerBlock
  was false # records in received blocks = 
 [11,10,10,10,10,10,10,10,10,10,10,4,16,10,10,10,10,10,10,10], not between 7 
 and 11
 Stacktrace
 sbt.ForkMain$ForkError: recordedBlocks.drop(1).dropRight(1).forall(((block: 
 scala.collection.mutable.ArrayBuffer[Int]) = 
 block.size.=(minExpectedMessagesPerBlock).(block.size.=(maxExpectedMessagesPerBlock
  was false # records in received blocks = 
 [11,10,10,10,10,10,10,10,10,10,10,4,16,10,10,10,10,10,10,10], not between 7 
 and 11
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply$mcV$sp(ReceiverSuite.scala:200)
   at 
 org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
   at 
 org.apache.spark.streaming.ReceiverSuite$$anonfun$3.apply(ReceiverSuite.scala:158)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceiverSuite.scala:39)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.ReceiverSuite.runTest(ReceiverSuite.scala:39)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.ReceiverSuite.org$scalatest$BeforeAndAfter$$super$run(ReceiverSuite.scala:39)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at org.apache.spark.streaming.ReceiverSuite.run(ReceiverSuite.scala:39)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311500#comment-14311500
 ] 

DeepakVohra commented on SPARK-5625:


On re-test Spark classes get found in Spark application. 

But the following error is still generated with RunRecommender.


Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC 
version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:245)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1351)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:867)
at 
org.apache.spark.rdd.DoubleRDDFunctions.stats(DoubleRDDFunctions.scala:43)
at 
com.cloudera.datascience.recommender.RunRecommender$.preparation(RunRecommender.scala:63)
at 
com.cloudera.datascience.recommender.RunRecommender$.main(RunRecommender.scala:29)
at 
com.cloudera.datascience.recommender.RunRecommender.main(RunRecommender.scala)



 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4


[ 
https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311506#comment-14311506
 ] 

DeepakVohra commented on SPARK-5631:


This means you have mismatched Hadoop versions, either between your Spark and 
Hadoop deployment, 

Hadoop version is hadoop-2.0.0-cdh4.2.0.tar.gz.  
Spark binaries are compiled with the same version: spark-1.2.0-bin-cdh4.tgz

or because you included Hadoop code in your app.

The Spark application is the RunRecommender application. 



 Server IPC version 7 cannot communicate with   client version 4   
 --

 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
 Spark 1.2
 CDH4.2
Reporter: DeepakVohra

 A Spark application generates the error
 Server IPC version 7 cannot communicate with client version 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input


[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311530#comment-14311530
 ] 

Apache Spark commented on SPARK-5021:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/4459

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4


[ 
https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311531#comment-14311531
 ] 

Sean Owen commented on SPARK-5631:
--

So, one problem is that the {{cdh4}} binary is compiled vs 
{{2.0.0-mr1-cdh4.2.0}}. This may be the problem, that the build you downloaded 
is for a different flavor of CDH4. Although none of those are officially 
supported, I don't see why it wouldn't work to build Spark with {{-Pyarn -Phive 
-Phive-thriftserver -Dhadoop.version=2.0.0-cdh4.2.0}}. That would rule out 
that difference.

The second potential difference, your app vs server, is avoided if you do not 
bundle Spark or Hadoop with your app, and run it with spark-submit. It doesn't 
matter what your app is.

 Server IPC version 7 cannot communicate with   client version 4   
 --

 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
 Spark 1.2
 CDH4.2
Reporter: DeepakVohra

 A Spark application generates the error
 Server IPC version 7 cannot communicate with client version 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311533#comment-14311533
 ] 

Sean Owen commented on SPARK-5625:
--

You asked this in a separate issue and it is discussed in SPARK-5631.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-08 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311532#comment-14311532
 ] 

Manoj Kumar commented on SPARK-5021:


I have created a working pull request. Let us please take the discussion there.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5273) Improve documentation examples for LinearRegression

2015-02-08 Thread Dev Lakhani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dev Lakhani updated SPARK-5273:
---
Affects Version/s: (was: 1.2.0)

 Improve documentation examples for LinearRegression 
 

 Key: SPARK-5273
 URL: https://issues.apache.org/jira/browse/SPARK-5273
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Dev Lakhani
Priority: Minor

 In the document:
 https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
 Under
 Linear least squares, Lasso, and ridge regression
 The suggested method to use LinearRegressionWithSGD.train()
 // Building the model
 val numIterations = 100
 val model = LinearRegressionWithSGD.train(parsedData, numIterations)
 is not ideal even for simple examples such as y=x. This should be replaced 
 with more real world parameters with step size:
 val lr = new LinearRegressionWithSGD()
 lr.optimizer.setStepSize(0.0001)
 lr.optimizer.setNumIterations(100)
 or
 LinearRegressionWithSGD.train(input,100,0.0001)
 To create a reasonable MSE. It took me a while using the dev forum to learn 
 that the step size should be really small. Might help save someone the same 
 effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5273) Improve documentation examples for LinearRegression

2015-02-08 Thread Dev Lakhani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dev Lakhani updated SPARK-5273:
---
Affects Version/s: 1.2.0

 Improve documentation examples for LinearRegression 
 

 Key: SPARK-5273
 URL: https://issues.apache.org/jira/browse/SPARK-5273
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Dev Lakhani
Priority: Minor

 In the document:
 https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
 Under
 Linear least squares, Lasso, and ridge regression
 The suggested method to use LinearRegressionWithSGD.train()
 // Building the model
 val numIterations = 100
 val model = LinearRegressionWithSGD.train(parsedData, numIterations)
 is not ideal even for simple examples such as y=x. This should be replaced 
 with more real world parameters with step size:
 val lr = new LinearRegressionWithSGD()
 lr.optimizer.setStepSize(0.0001)
 lr.optimizer.setNumIterations(100)
 or
 LinearRegressionWithSGD.train(input,100,0.0001)
 To create a reasonable MSE. It took me a while using the dev forum to learn 
 that the step size should be really small. Might help save someone the same 
 effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-08 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310130#comment-14310130
 ] 

Sandy Ryza edited comment on SPARK-4550 at 2/8/15 9:07 PM:
---

I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of Records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time (ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|50 million|false|10166122563|17|101831|89960|191791|
|50 million|true|3067937592|5|76801|78361|155161|


was (Author: sandyr):
I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of Records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time (ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|10 million|false|10166122563|17|101831|89960|191791|
|10 million|true|3067937592|5|76801|78361|155161|

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4588) Add API for feature attributes


[ 
https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311586#comment-14311586
 ] 

Apache Spark commented on SPARK-4588:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4460

 Add API for feature attributes
 --

 Key: SPARK-4588
 URL: https://issues.apache.org/jira/browse/SPARK-4588
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Sean Owen

 Feature attributes, e.g., continuous/categorical, feature names, feature 
 dimension, number of categories, number of nonzeros (support) could be useful 
 for ML algorithms.
 In SPARK-3569, we added metadata to schema, which can be used to store 
 feature attributes along with the dataset. We need to provide a wrapper over 
 the Metadata class for ML usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5674) Spark Job Explain Plan Proof of Concept

2015-02-08 Thread Kostas Sakellis (JIRA)

Kostas Sakellis created SPARK-5674:
--

 Summary: Spark Job Explain Plan Proof of Concept
 Key: SPARK-5674
 URL: https://issues.apache.org/jira/browse/SPARK-5674
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis


This is just a prototype of creating an explain plan for a job. Code can be 
found here: https://github.com/ksakellis/spark/tree/kostas-explainPlan-poc The 
code was written very quickly and so doesn't have any comments, tests and is 
probably buggy - hence it being a proof of concept.

*How to Use*

# {code}sc.explainOn = sc.explainOff{code} This will generate the explain 
plain and print it in the logs
# {code}sc.enableExecution = sc.disableExecution{code} This will disable 
executing of the job.

Using these two knobs a user can choose to print the explain plan and/or 
disable the running of the job if they only want to see the plan.


*Implementation*

This is only a prototype and it is by no means production ready. The code is 
pretty hacky in places and a few shortcuts were made just to get the prototype 
working.

The most interesting part of this commit is in the ExecutionPlanner.scala 
class. This class creates its own private instance of the DAGScheduler and 
passes into it a NoopTaskScheduler. The NoopTaskScheduler receives the created 
TaskSets from the DAGScheduler and records the stages - tasksets. The 
NoopTaskScheduler also creates fake CompletionsEvents and sends them to the 
DAGScheduler to move the scheduling along. It is done this way so that we can 
use the DAGScheduler unmodified thus reducing code divergence.

The rest of the code is about processing the information produced by the 
ExecutionPlanner, creating a DAG with a bunch of metadata and printing it as a 
pretty ascii drawing. For drawing the DAG, https://github.com/mdr/ascii-graphs 
is used. This was just easier again to prototype.

*How is this different than RDD#toDebugString?*

The execution planner runs the job through the entire DAGScheduler so we can 
collect some metrics that are not presently available in the debugString. For 
example, we can report the binary size of the task which might be important if 
the closures are referencing large object.

In addition, because we execute the scheduler code from an action, we can get a 
more accurate picture of where the stage boundaries and dependencies. An action 
such ask treeReduce will generate a number of stages that you can't get just by 
doing .toDebugString on the rdd.


*Limitations of this Implementation*

Because this is a prototype there are is a lot of lame stuff in this commit.

# All of the code in SparkContext in particular sucks. This adds some code in 
the runJob() call and when it gets the plan it just writes it to the INFO log. 
We need to find a better way of exposing the plan to the caller so that they 
can print it, analyze it etc. Maybe we can use implicits or something? Not sure 
how best to do this yet.
# Some of the actions will return through exceptions because we are basically 
faking a runJob(). If you want ot try this, it is best to just use count() 
instead of say collect(). This will get fixed when we fix 1)
# Because the ExplainPlanner creates its own DAGScheduler, there currently is 
no way to map the real stages to the explainPlan stages. So if a user turns 
on explain plan, and doesn't disable execution, we can't automatically add more 
metrics to the explain plan as they become available. The stageId in the plan 
and the stageId in the real scheduler will be different. This is important for 
when we add it to the webUI and users can track progress on the DAG
# We are using https://github.com/mdr/ascii-graphs to draw the DAG - not sure 
if we want to depend on that project.

*Next Steps*

# It would be good to get a few people to take a look at the code specifically 
at how the plan gets generated. Clone the package and give it a try with some 
of your jobs
# If the approach looks okay overall, I can put together a mini design doc and 
add some answers to the above limitations of this approach. 
#Feedback most welcome.

*Example Code:*

{code}
sc.explainOn
sc.disableExecution

val rdd = sc.parallelize(1 to 10, 4).map(key = (key.toString, key))
val rdd2 = sc.parallelize(1 to 5, 2).map(key = (key.toString, key))
rdd.join(rdd2)
   .count()
{code}


*Example Output:*

{noformat}
EXPLAIN PLAN:

 +---+ +---+
 |   | |   |
 |Stage: 0 @ map | |Stage: 1 @ map |
 |   Tasks: 4| |   Tasks: 2|
 |   | |   |
 +---+ +---+
   | |
   v v
 +-+
 | |
 |Stage: 2 @ count |
 |Tasks: 4 |
 | |
 +-+

 STAGE DETAILS:
 --

 Stage: 0

[jira] [Commented] (SPARK-5635) Allow users to run .scala files directly from spark-submit

2015-02-08 Thread Grant Henke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311588#comment-14311588
 ] 

Grant Henke commented on SPARK-5635:


The method I listed I thought to be a workaround and not necessarily intended 
functionality. Especially because I need to add exit to the bottom of the 
script to be sure I break out of interactive mode.

I suggest adding the functionality to spark-submit because spark-shell does not 
share/support all features of spark-submit's functionality. Instead it supports 
uses and features around interactive/client use. This functionality is very 
similar to passing a Python script to spark-submit so it appeared the correct 
place to run a Scala script as well.

 Allow users to run .scala files directly from spark-submit
 --

 Key: SPARK-5635
 URL: https://issues.apache.org/jira/browse/SPARK-5635
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Reporter: Grant Henke
Priority: Minor

 Similar to the python functionality allow users to submit .scala files.
 Currently the way I simulate this is to use spark-shell and run: `spark-shell 
 -i myscript.scala`
 Note: user needs to add exit to the bottom of the script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core


[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311596#comment-14311596
 ] 

DeepakVohra commented on SPARK-5625:


Thanks Sean.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4


[ 
https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311597#comment-14311597
 ] 

DeepakVohra commented on SPARK-5631:


Thanks for the clarification.

 Server IPC version 7 cannot communicate with   client version 4   
 --

 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
 Spark 1.2
 CDH4.2
Reporter: DeepakVohra

 A Spark application generates the error
 Server IPC version 7 cannot communicate with client version 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4


[ 
https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311597#comment-14311597
 ] 

DeepakVohra edited comment on SPARK-5631 at 2/8/15 10:38 PM:
-

Thanks for the clarification. The error gets removed.


was (Author: dvohra):
Thanks for the clarification.

 Server IPC version 7 cannot communicate with   client version 4   
 --

 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
 Spark 1.2
 CDH4.2
Reporter: DeepakVohra

 A Spark application generates the error
 Server IPC version 7 cannot communicate with client version 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2958) FileClientHandler should not be shared in the pipeline

2015-02-08 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311602#comment-14311602
 ] 

Reynold Xin commented on SPARK-2958:


cc [~adav] this is no longer a problem in the new shuffle module, is it?

 FileClientHandler should not be shared in the pipeline
 --

 Key: SPARK-2958
 URL: https://issues.apache.org/jira/browse/SPARK-2958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 Netty module creates a single FileClientHandler and shares it in all threads. 
 We should create a new one for each pipeline thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode

[
https://issues.apache.org/jira/browse/SPARK-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-3991:
-
Priority: Major (was: Blocker)

Downgrading until it's clear what the issue is. There are several items here.

1. This sounds like the same issue raised in SPARK-4944
2. You might need to provide more info, like what the nature of the join is
3. This sounds related to SPARK-3914, maybe solved by it

I suggest tracking one issue per JIRA. If one of these are still relevant and
not duplicates, maybe this can change to track that one, and if there are more
than one, track one here and create another JIRA for another.

Not Serializable , Nullpinter Exceptions in SQL server mode
---

Key: SPARK-3991
URL: https://issues.apache.org/jira/browse/SPARK-3991
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.1.0
Reporter: eblaas
Attachments: not_serializable_exception.patch

I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it
works but there are some bugs to fix.
I customized the HiveThriftServer2 class to load, transform and register
tables (ETL) with the HiveContext. Data tables are generated from Cassandra
and from a relational database.
* 1 st problem :
hiveContext.registerRDDAsTable(treeSchema,tree) , does not register the
table in hive metastore (show tables; via JDBC does not list the table, but
I can query it e.g. select * from tree) dirty workaround create a table with
same name and schema, this was necessary because mondrian validates table
existence
hiveContext.sql(CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3
STRING))
* 2 nd problem :
mondrian creates complex joins, witch results in Serialization Exceptions
2 classes in hibeUdfs.scala have to be serializable
- DeferredObjectAdapter and HiveGenericUdaf
* 3 td problem
Nullpointer Exception in InMemoryRelation
42: override lazy val statistics = Statistics(sizeInBytes =
child.sqlContext.defaultSizeInBytes)
the sqlContext in child was null, quick fix set default value from
SparkContext
override lazy val statistics = Statistics(sizeInBytes = 1)
I'm not sure how to fix this bugs but with the patch file it works at least.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3034:
-
Component/s: (was: Spark Core)
   Priority: Major  (was: Blocker)

Can you provide steps to reproduce this, and/or check whether it's still an 
issue? downgrading until there is more info.

 [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
 -

 Key: SPARK-3034
 URL: https://issues.apache.org/jira/browse/SPARK-3034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: pengyanhong

 run a simple HiveQL via yarn-cluster, got error as below:
 {quote}
 Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 
 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: 
 java.sql.Date cannot be cast to java.sql.Timestamp
 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)
 
 org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at

[jira] [Resolved] (SPARK-2998) scala.collection.mutable.HashSet cannot be cast to scala.collection.mutable.BitSet


 [ 
https://issues.apache.org/jira/browse/SPARK-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2998.
--
Resolution: Duplicate

 scala.collection.mutable.HashSet cannot be cast to 
 scala.collection.mutable.BitSet
 --

 Key: SPARK-2998
 URL: https://issues.apache.org/jira/browse/SPARK-2998
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Serialized task 
 8.0:2 as 20849 bytes in 0 ms
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Finished TID 812 in 
 24 ms on A01-R06-I149-32.jd.local (progress: 2/200)
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Completed 
 ResultTask(8, 1)
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Failed to run 
 reduce at joins.scala:336
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): 
 finishApplicationMaster with FAILED
 Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
 Caused by: org.apache.spark.SparkDriverExecutionException: Execution error
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:849)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1231)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: java.lang.ClassCastException: scala.collection.mutable.HashSet 
 cannot be cast to scala.collection.mutable.BitSet
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$7.apply(joins.scala:336)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:810)
   at 
 org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:845)
   ... 10 more
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Invoking sc stop 
 from shutdown hook
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): AppMaster received 
 a signal.
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Starting task 8.0:3 
 as TID 814 on executor 1: A01-R06-I149-32.jd.local (PROCESS_LOCAL)
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Serialized task 
 8.0:3 as 20849 bytes in 0 ms
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Finished TID 813 in 
 25 ms on A01-R06-I149-32.jd.local (progress: 3/200)
 14/08/13 11:10:01 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:58): Completed 
 ResultTask(8, 2)
 ..
 {quote}
 It runs successfully if removing the configuration about Kryo



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4840) Incorrect documentation of master url on Running Spark on Mesos page