sbt/sbt assembly fails with ssl certificate error
I am facing a weird failure where sbt/sbt assembly” shows a lot of SSL certificate errors for repo.maven.apache.org. Is anyone else facing the same problems? Any idea why this is happening? Yesterday I was able to successfully run it. Loading https://repo.maven.apache.org shows an invalid cert chain. — Thanks
Re: sbt/sbt assembly fails with ssl certificate error
I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I see: https://issues.sonatype.org/browse/MVNCENTRAL-377 https://issues.apache.org/jira/browse/INFRA-7363 ... which suggest that actually this isn't supported, and should return 404. Then I see: http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4 ... that suggests you have to pay for HTTPS access to Maven Central? But, we both seem to have found it working (even after the change above) and it does not return 404 now. The Maven build works, still, but if you look carefully, it's actually only because it eventually falls back to HTTP for Maven Central artifacts. So I think the thing to do is simply back off to HTTP for Maven Central only. That's unfortunate because there's a small but non-trivial security issue here in downloading artifacts without security. Any brighter ideas? I'll open a supplementary PR if so to adjust this.
Re: distinct on huge dataset
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote: FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, Kane kane.ist...@gmail.com wrote: But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
error loading large files in PySpark 0.9.0
Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile(path/to/myfile) data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X (progress: 15/537) 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4) And then this repeats indefinitely 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 Always stalls at the same place. There's nothing in stderr on the workers, but in stdout there are several of these messages: INFO PythonRDD: stdin writer to Python finished early So perhaps the real error is being suppressed as in https://spark-project.atlassian.net/browse/SPARK-1025 Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k characters per row. Running on a private cluster with 10 nodes, 100GB / 16 cores each, Python v 2.7.6. I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0, and in PySpark in 0.8.1. Happy to post the file, but it should repro for anything with these dimensions. It *might* be specific to long strings: I don't see it with fewer characters (10k) per row, but I also don't see it with many fewer rows but the same number of characters per row. Happy to try and provide more info / help debug! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: combining operations elegantly
Hi Koert, Patrick, do you already have an elegant solution to combine multiple operations on a single RDD? Say for example that I want to do a sum over one column, a count and an average over another column, thanks in advance, Richard On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling rsiebel...@gmail.comwrote: Patrick, Koert, I'm also very interested in these examples, could you please post them if you find them? thanks in advance, Richard On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers ko...@tresata.com wrote: not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you please repost? thanks!
Re: distinct on huge dataset
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt/sbt assembly fails with ssl certificate error
I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit Is it also due to the SSL error ? Thanks. Deb On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote: I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I see: https://issues.sonatype.org/browse/MVNCENTRAL-377 https://issues.apache.org/jira/browse/INFRA-7363 ... which suggest that actually this isn't supported, and should return 404. Then I see: http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4 ... that suggests you have to pay for HTTPS access to Maven Central? But, we both seem to have found it working (even after the change above) and it does not return 404 now. The Maven build works, still, but if you look carefully, it's actually only because it eventually falls back to HTTP for Maven Central artifacts. So I think the thing to do is simply back off to HTTP for Maven Central only. That's unfortunate because there's a small but non-trivial security issue here in downloading artifacts without security. Any brighter ideas? I'll open a supplementary PR if so to adjust this.
Re: sbt/sbt assembly fails with ssl certificate error
These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das debasish.da...@gmail.comwrote: I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit Is it also due to the SSL error ? Thanks. Deb On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote: I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I see: https://issues.sonatype.org/browse/MVNCENTRAL-377 https://issues.apache.org/jira/browse/INFRA-7363 ... which suggest that actually this isn't supported, and should return 404. Then I see: http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4 ... that suggests you have to pay for HTTPS access to Maven Central? But, we both seem to have found it working (even after the change above) and it does not return 404 now. The Maven build works, still, but if you look carefully, it's actually only because it eventually falls back to HTTP for Maven Central artifacts. So I think the thing to do is simply back off to HTTP for Maven Central only. That's unfortunate because there's a small but non-trivial security issue here in downloading artifacts without security. Any brighter ideas? I'll open a supplementary PR if so to adjust this.
No space left on device exception
Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device)) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device)) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) Thanks! Ognen
Re: No space left on device exception
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the df command to check to free space of /tmp or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk. If your disk size is in the TB range, it's possible that the last couple GB have issues when being allocated due to fragmentation or reclamation policies. On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski og...@nengoiksvelzud.comwrote: Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device)) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device)) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$ apache$spark$scheduler$DAGScheduler$$abortStage$1. apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$ apache$spark$scheduler$DAGScheduler$$abortStage$1. apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach( ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$ scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$ processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$ processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent( DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$ $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) Thanks! Ognen
Re: combining operations elegantly
Hey All, I think the old thread is here: https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J The method proposed in that thread is to create a utility class for doing single-pass aggregations. Using Algebird is a pretty good way to do this and is a bit more flexible since you don't need to create a new utility each time you want to do this. In Spark 1.0 and later you will be able to do this more elegantly with the schema support: myRDD.groupBy('user).select(Sum('clicks) as 'clicks, Average('duration) as 'duration) and it will use a single pass automatically... but that's not quite released yet :) - Patrick On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote: i currently typically do something like this: scala val rdd = sc.parallelize(1 to 10) scala import com.twitter.algebird.Operators._ scala import com.twitter.algebird.{Max, Min} scala rdd.map{ x = ( | 1L, | Min(x), | Max(x), | x | )}.reduce(_ + _) res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int], Int) = (10,Min(1),Max(10),55) however for this you need twitter algebird dependency. without that you have to code the reduce function on the tuples yourself... another example with 2 columns, where i do conditional count for first column, and simple sum for second: scala sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) = ( | if (x 5) 1 else 0, | y | )}.reduce(_ + _) res3: (Int, Int) = (5,155) On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling rsiebel...@gmail.com wrote: Hi Koert, Patrick, do you already have an elegant solution to combine multiple operations on a single RDD? Say for example that I want to do a sum over one column, a count and an average over another column, thanks in advance, Richard On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling rsiebel...@gmail.com wrote: Patrick, Koert, I'm also very interested in these examples, could you please post them if you find them? thanks in advance, Richard On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers ko...@tresata.com wrote: not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you please repost? thanks!
Re: No space left on device exception
On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen
Re: Problem with SparkR
Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just want to check if the same error is cropping up again. 2. Is there a stack trace that follows the IncompatibleClassChangeError ? If so could you attach that ? The error message indicates there is some incompatibility between class versions and having a more detailed stack trace might help us track this down. Thanks Shivaram On Sun, Mar 23, 2014 at 4:48 PM, Jacques Basaldúa jacq...@dybot.com wrote: I am really interested in using Spark from R and have tried to use SparkR, but always get the same error. This is how I installed: - I successfully installed Spark version 0.9.0 with Scala 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_45) I can run examples from spark-shell and Python - I installed the R package devtools and installed SparkR using: - library(devtools) - install_github(amplab-extras/SparkR-pkg, subdir=pkg) This compiled the package successfully. When I try to run the package E.g., library(SparkR) sc - sparkR.init(master=local) //- so far the program runs fine rdd - parallelize(sc, 1:10) // This returns the following error Error in .jcall(getJRDD(rdd), Ljava/util/List;, collect) : java.lang.IncompatibleClassChangeError: org/apache/spark/util/InnerClosureFinder No matter how I try to use the sc (I have tried all the examples) I always get an error. Any ideas? Jacques.
is it possible to access the inputsplit in Spark directly?
Hello, In spark we can use *newAPIHadoopRDD *to access the different distributed system like HDFS, HBase, and MongoDB via different inputformat. Is it possible to access the *inputsplit *in Spark directly? Spark can cache data in local memory. Perform local computation/aggregation on the local inputsplit could speed up the whole performance. Thanks a lot
Re: error loading large files in PySpark 0.9.0
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile(path/to/myfile) data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X (progress: 15/537) 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4) And then this repeats indefinitely 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, runningTasks: 144 Always stalls at the same place. There's nothing in stderr on the workers, but in stdout there are several of these messages: INFO PythonRDD: stdin writer to Python finished early So perhaps the real error is being suppressed as in https://spark-project.atlassian.net/browse/SPARK-1025 Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k characters per row. Running on a private cluster with 10 nodes, 100GB / 16 cores each, Python v 2.7.6. I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0, and in PySpark in 0.8.1. Happy to post the file, but it should repro for anything with these dimensions. It *might* be specific to long strings: I don't see it with fewer characters (10k) per row, but I also don't see it with many fewer rows but the same number of characters per row. Happy to try and provide more info / help debug! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt/sbt assembly fails with ssl certificate error
I don’t see the errors anymore. Thanks Aaron. On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote: These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das debasish.da...@gmail.com wrote: I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ... [error] Server access Error: handshake alert: unrecognized_name url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit Is it also due to the SSL error ? Thanks. Deb On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote: I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I see: https://issues.sonatype.org/browse/MVNCENTRAL-377 https://issues.apache.org/jira/browse/INFRA-7363 ... which suggest that actually this isn't supported, and should return 404. Then I see: http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4 ... that suggests you have to pay for HTTPS access to Maven Central? But, we both seem to have found it working (even after the change above) and it does not return 404 now. The Maven build works, still, but if you look carefully, it's actually only because it eventually falls back to HTTP for Maven Central artifacts. So I think the thing to do is simply back off to HTTP for Maven Central only. That's unfortunate because there's a small but non-trivial security issue here in downloading artifacts without security. Any brighter ideas? I'll open a supplementary PR if so to adjust this.
Re: No space left on device exception
Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the disk and the free inodes are at 50% (this is on master). I went through each slave and the spark/work/app*/stderr and stdout and spark/logs/*out files and no mention of too many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition is particularly small. You might look at the actual executors' logs, as it's possible that this error was caused by an earlier exception, such as too many open files. On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote: On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it's recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen
Re: No space left on device exception
Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (and sorry for the noise)! Ognen On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the disk and the free inodes are at 50% (this is on master). I went through each slave and the spark/work/app*/stderr and stdout and spark/logs/*out files and no mention of too many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition is particularly small. You might look at the actual executors' logs, as it's possible that this error was caused by an earlier exception, such as too many open files. On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote: On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it's recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen -- A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable -- Leslie Lamport
Re: No space left on device exception
Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (and sorry for the noise)! Ognen On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the disk and the free inodes are at 50% (this is on master). I went through each slave and the spark/work/app*/stderr and stdout and spark/logs/*out files and no mention of too many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition is particularly small. You might look at the actual executors' logs, as it's possible that this error was caused by an earlier exception, such as too many open files. On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it's recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen -- A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable -- Leslie Lamport
Re: No space left on device exception
I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote: Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (and sorry for the noise)! Ognen On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the disk and the free inodes are at 50% (this is on master). I went through each slave and the spark/work/app*/stderr and stdout and spark/logs/*out files and no mention of too many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition is particularly small. You might look at the actual executors' logs, as it's possible that this error was caused by an earlier exception, such as too many open files. On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote: On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it's recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen -- A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable -- Leslie Lamport -- No matter what they ever do to us, we must always act for the love of our people and the earth. We must not react out of hatred against those who have no sense. -- John Trudell
How many partitions is my RDD split into?
Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How many partitions is my RDD split into?
It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: How many partitions is my RDD split into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: How many partitions is my RDD split into?
As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick View this message in context: How many partitions is my RDD split into? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: No space left on device exception
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a No space left on device error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote: I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (and sorry for the noise)! Ognen On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the disk and the free inodes are at 50% (this is on master). I went through each slave and the spark/work/app*/stderr and stdout and spark/logs/*out files and no mention of too many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition is particularly small. You might look at the actual executors' logs, as it's possible that this error was caused by an earlier exception, such as too many open files. On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it's recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions in a transformation influence something in terms of disk space consumption? Or inode consumption? Thanks, Ognen -- A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable -- Leslie Lamport -- No matter what they ever do to us, we must always act for the love of our people and the earth. We must not react out of hatred against those who have no sense. -- John Trudell