Re: how to publish spark inhouse?
i just looked at my dependencies in sbt, and when using cdh4.5.0 dependencies i see that hadoop clients pulls in jboss netty (via zookeeper) and asm 3.x (via jersey-server). so somehow these exclusion rules are not working anymore? i will look into sbt-pom-reader a bit to try to understand whats happening On Mon, Jul 28, 2014 at 8:45 PM, Patrick Wendell pwend...@gmail.com wrote: All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote: ah ok thanks. guess i am gonna read up about maven-release-plugin then! On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is not something you edit yourself. The Maven release plugin manages setting all this. I think virtually everything you're worried about is done for you by this plugin. Maven requires artifacts to set a version and it can't inherit one. I feel like I understood the reason this is necessary at one point. On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote: and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewhat sick too 3) i am considering alternative careers to developing... where am i supposed to look? thanks for your help!
how to publish spark inhouse?
hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewhat sick too 3) i am considering alternative careers to developing... where am i supposed to look? thanks for your help!
Re: how to publish spark inhouse?
and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewhat sick too 3) i am considering alternative careers to developing... where am i supposed to look? thanks for your help!
Re: how to publish spark inhouse?
ah ok thanks. guess i am gonna read up about maven-release-plugin then! On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is not something you edit yourself. The Maven release plugin manages setting all this. I think virtually everything you're worried about is done for you by this plugin. Maven requires artifacts to set a version and it can't inherit one. I feel like I understood the reason this is necessary at one point. On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote: and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel somewhat sick too 3) i am considering alternative careers to developing... where am i supposed to look? thanks for your help!
graphx cached partitions wont go away
i have graphx queries running inside a service where i collect the results to the driver and do not hold any references to the rdds involved in the queries. my assumption was that with the references gone spark would go and remove the cached rdds from memory (note, i did not cache them, graphx did). yet they hang around... is my understanding of how the ContextCleaner works incorrect? or could it be that grapx holds some references internally to rdds, preventing garbage collection? maybe even circular references?
using shapeless in spark to optimize data layout in memory
hello all, in case anyone is interested, i just wrote a short blog about using shapeless in spark to optimize data layout in memory. blog is here: http://tresata.com/tresata-open-sources-spark-columnar code is here: https://github.com/tresata/spark-columnar
Re: replacement for SPARK_LIBRARY_PATH ?
but be aware that spark-defaults.conf is only used if you use spark-submit On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote: One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here: http://spark.apache.org/docs/latest/configuration.html On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I used to use SPARK_LIBRARY_PATH to specify the location of native libs for lzo compression when using spark 0.9.0. The references to that environment variable have disappeared from the docs for spark 1.0.1 and it's not clear how to specify the location for lzo. Any guidance?
Re: spark ui on yarn
my yarn environment does have less memory for the executors. i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which shows an RDD as fully cached in memory, yet it does not show up in the UI On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia matei.zaha...@gmail.com wrote: The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote: hey shuo, so far all stage links work fine for me. i did some more testing, and it seems kind of random what shows up on the gui and what does not. some partially cached RDDs make it to the GUI, while some fully cached ones do not. I have not been able to detect a pattern. is the codebase for the gui different in standalone than in yarn-client mode? On Sat, Jul 12, 2014 at 3:34 AM, Shuo Xiang shuoxiang...@gmail.com wrote: Hi Koert, Just curious did you find any information like CANNOT FIND ADDRESS after clicking into some stage? I've seen similar problems due to lost of executors. Best, On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote: I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on storage page. spark 1.0.0
Re: spark ui on yarn
hey shuo, so far all stage links work fine for me. i did some more testing, and it seems kind of random what shows up on the gui and what does not. some partially cached RDDs make it to the GUI, while some fully cached ones do not. I have not been able to detect a pattern. is the codebase for the gui different in standalone than in yarn-client mode? On Sat, Jul 12, 2014 at 3:34 AM, Shuo Xiang shuoxiang...@gmail.com wrote: Hi Koert, Just curious did you find any information like CANNOT FIND ADDRESS after clicking into some stage? I've seen similar problems due to lost of executors. Best, On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote: I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on storage page. spark 1.0.0
spark ui on yarn
I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on storage page. spark 1.0.0
Re: Purpose of spark-submit?
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?
Re: RDD Cleanup
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I using spark 1.0.0 , using Ooyala Job Server, for a low latency query system. Basically a long running context is created, which enables to run multiple jobs under the same context, and hence sharing of the data. It was working fine in 0.9.1. However in spark 1.0 release, the RDD's created and cached by a Job-1 gets cleaned up by BlockManager (can see log statements saying cleaning up RDD) and so the cached RDD's are not available for Job-2, though Both Job-1 and Job-2 are running under same spark context. I tried using the spark.cleaner.referenceTracking = false setting, how-ever this causes the issue that unpersisted RDD's are not cleaned up properly, and occupying the Spark's memory.. Had anybody faced issue like this before? If so, any advice would be greatly appreicated. Also is there any way, to mark an RDD as being used under a context, event though the job using that had been finished (so subsequent jobs can use that RDD). Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD Cleanup
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote: Hi, Yes . I am caching the RDD's by calling cache method.. May i ask, how you are sharing RDD's across jobs in same context? By the RDD name. I tried printing the RDD's of the Spark context, and when the referenceTracking is enabled, i get empty list after the clean up. Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Purpose of spark-submit?
sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available on the yarn node). my guess is that spark-submit is changing some settings (perhaps preparing the distributed cache and modifying settings accordingly), which makes it harder to run things programmatically. i could be wrong however. i gave up debugging and resorted to using spark-submit for now. On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH
Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?
do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on how to solve this? Spark seems to use jetty 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source of the problem. Any ideas? On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi! I am building a web frontend for a Spark app, allowing users to input sql/hql and get results back. When starting a SparkContext from within my server code (using jetty/dropwizard) I get the error java.lang.NoSuchMethodError: org.eclipse.jetty.server.AbstractConnector: method init()V not found when Spark tries to fire up its own jetty server. This does not happen when running the same code without my web server. This is probably fixable somehow(?) but I'd like to disable the webUI as I don't need it, and ideally I would like to access that information programatically instead, allowing me to embed it in my own web application. Is this possible? -- Best regards, Martin Gammelsæter -- Mvh. Martin Gammelsæter 92209139
Re: graphx Joining two VertexPartitions with different indexes is slow.
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is carefully managed. I think? I am currently forced to do unpersists on vertices after almost every intermediate graph transformation, or accept my rdd cache getting polluted On Jul 7, 2014 12:03 AM, Ankur Dave ankurd...@gmail.com wrote: Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before resorting to the slow code path. Ankur http://www.ankurdave.com/
tiers of caching
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it make sense to create tiers of caching, to distinguish explicitly cached RDDs by the application from RDDs that are temporary cached by algos, so as to make sure these temporary caches don't push application RDDs out of memory?
Re: spark-assembly libraries conflict with needed libraries
spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote: spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use spark-submit, we get NoMethodErrors, even though the code compiles fine, because the runtime classes are different than the compile time classes! Can someone recommend a solution? We are using scala, sbt, and sbt-assembly, but are happy using another tool (please provide instructions how to).
acl for spark ui
i was testing using the acl for spark ui in secure mode on yarn in client mode. it works great. my spark 1.0.0 configuration has: spark.authenticate = true spark.ui.acls.enable = true spark.ui.view.acls = koert spark.ui.filters = org.apache.hadoop.security.authentication.server.AuthenticationFilter spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params=type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN ,kerberos.keytab=/some/keytab i confirmed that i can access the ui from firefox after doing kinit. however i also saw this in the logs of my driver program: 2014-07-07 17:21:56 DEBUG server.Server: RESPONSE /broadcast_0 401 handled=true and 2014-07-07 17:21:56 DEBUG server.Server: REQUEST /jars/somejar-assembly-0.1-SNAPSHOT.jar on BlockingHttpConnection@3d6396f5 ,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParse\ r{s=-5,l=10,c=0},r=1 2014-07-07 17:21:56 DEBUG server.Server: RESPONSE /jars/somejar-assembly-0.1-SNAPSHOT.jar 401 handled=true what does this mean? is the webserver also responsible for handing out other stuff such as broadcast variables and jars, and is this now being rejected by my servlet filter? thats not good... the 401 response is exactly the same one i see when i try to access the website after kdestroy. for example: 2014-07-07 17:35:08 DEBUG server.AuthenticationFilter: Request [ http://mybox:5001/] triggering authentication 2014-07-07 17:35:08 DEBUG server.Server: RESPONSE / 401 handled=true
Re: graphx Joining two VertexPartitions with different indexes is slow.
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote: When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality. Without caching, two copies of the index are created. Although the two indexes are structurally identical, they fail reference equality, and so GraphX mistakenly uses the slow path involving a hash lookup per joined element. I'm working on a patch https://github.com/apache/spark/pull/1297 that attempts an optimistic zip join with per-element fallback to hash lookups, which would improve this situation. Ankur http://www.ankurdave.com/
Re: taking top k values of rdd
hey nick, you are right. i didnt explain myself well and my code example was wrong... i am keeping a priority-queue with k items per partition (using com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes of the queues). but this still means i am sending k items per partition to my driver, so k x p, while i only need k. thanks! koert On Sat, Jul 5, 2014 at 1:21 PM, Nick Pentreath nick.pentre...@gmail.com wrote: To make it efficient in your case you may need to do a bit of custom code to emit the top k per partition and then only send those to the driver. On the driver you can just top k the combined top k from each partition (assuming you have (object, count) for each top k list). — Sent from Mailbox https://www.dropbox.com/mailbox On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote: my initial approach to taking top k values of a rdd was using a priority-queue monoid. along these lines: rdd.mapPartitions({ items = Iterator.single(new PriorityQueue(...)) }, false).reduce(monoid.plus) this works fine, but looking at the code for reduce it first reduces within a partition (which doesnt help me) and then sends the results to the driver where these again get reduced. this means that for every partition the (potentially very bulky) priorityqueue gets shipped to the driver. my driver is client side, not inside cluster, and i cannot change this, so this shipping to driver of all these queues can be expensive. is there a better way to do this? should i try to a shuffle first to reduce the partitions to the minimal amount (since number of queues shipped is equal to number of partitions)? is was a way to reduce to a single item RDD, so the queues stay inside cluster and i can retrieve the final result with RDD.first?
Re: graphx Joining two VertexPartitions with different indexes is slow.
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave ankurd...@gmail.com wrote: Oh, I just read your message more carefully and noticed that you're joining a regular RDD with a VertexRDD. In that case I'm not sure why the warning is occurring, but it might be worth caching both operands (graph.vertices and the regular RDD) just to be sure. Ankur http://www.ankurdave.com/
Re: MLLib : Math on Vector and Matrix
i did the second option: re-implemented .toBreeze as .breeze using pimp classes On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of code working with internals of MLLib. One of the big changes was the move from the old jblas.Matrix to the Vector/Matrix classes included in MLLib. However I don't see how we're supposed to use them for ANYTHING other than a container for passing data to the included APIs... how do we do any math on them? Looking at the internal code, there are quite a number of private[mllib] declarations including access to the Breeze representations of the classes. Was there a good reason this was not exposed? I could see maybe not wanting to expose the 'toBreeze' function which would tie it to the breeze implementation, however it would be nice to have the various mathematics wrapped at least. Right now I see no way to code any vector/matrix math without moving my code namespaces into org.apache.spark.mllib or duplicating the code in 'toBreeze' in my own util functions. Not very appealing. What are others doing? thanks, Thunder
why is toBreeze private everywhere in mllib?
its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?
Re: Spark's Hadooop Dependency
libraryDependencies ++= Seq( org.apache.spark %% spark-core % versionSpark % provided exclude(org.apache.hadoop, hadoop-client) org.apache.hadoop % hadoop-client % versionHadoop % provided ) On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com wrote: To add Spark to a SBT project, I do: libraryDependencies += org.apache.spark %% spark-core % 1.0.0 % provided How do I make sure that the spark version which will be downloaded will depend on, and use, Hadoop 2, and not Hadoop 1? Even with a line: libraryDependencies += org.apache.hadoop % hadoop-client % 2.4.0 I still see SBT downloading Hadoop 1: [debug] == resolving dependencies org.apache.spark#spark-core_2.10;1.0.0-org.apache.hadoop#hadoop-client;1.0.4 [compile-master(*)] [debug] dependency descriptor has been mediated: dependency: org.apache.hadoop#hadoop-client;2.4.0 {compile=[default(compile)]} = dependency: org.apache.hadoop#hadoop-client;1.0.4 {compile=[default(compile)]}
graphx Joining two VertexPartitions with different indexes is slow.
lately i am seeing a lot of this warning in graphx: org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. i am using Graph.outerJoinVertices to join in data from a regular RDD (that is co-partitioned). i would like this operation to be fast, since i use it frequently. should i be doing something different?
Re: Using Spark as web app backend
run your spark app in client mode together with a spray rest service, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend application ( build with nodejs ) to communicate with spark on the backend, so that every query from the frontend is rooted to spark. And the result from Spark are sent back to the frontend. Does some of you already experiment this kind of architecture ? Cheers, Jaonary
Re: trying to understand yarn-client mode
thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i could be on a different machine (so my file://) isnt going to work for them, or i could have the jars in a directory that is only readable by me. how come the jars are not just shipped to yarn as part of the job submittal? i am worried i am supposed to put the jars in a central location and yarn is going to fetch them from there, leading to jars in yet another place such as on hdfs which i find pretty messy. On Thu, Jun 19, 2014 at 2:54 PM, Marcelo Vanzin van...@cloudera.com wrote: Coincidentally, I just ran into the same exception. What's probably happening is that you're specifying some jar file in your job as an absolute local path (e.g. just /home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config has the default FS set to HDFS. So your driver does not know that it should tell executors to download that file from the driver. If you specify the jar with the file: scheme that should solve the problem. On Thu, Jun 19, 2014 at 10:22 AM, Koert Kuipers ko...@tresata.com wrote: i am trying to understand how yarn-client mode works. i am not using Application application_1403117970283_0014 failed 2 times due to AM Container for appattempt_1403117970283_0014_02 exited with exitCode: -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not exist .Failing this attempt.. Failing the application. -- Marcelo
spark on yarn is trying to use file:// instead of hdfs://
i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this? in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local filesystem. thanks! koert
Re: spark on yarn is trying to use file:// instead of hdfs://
yeah sure see below. i strongly suspect its something i misconfigured causing yarn to try to use local filesystem mistakenly. * [koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --executor-cores 1 hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10 14/06/20 12:54:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/20 12:54:40 INFO RMProxy: Connecting to ResourceManager at cdh5-yarn.tresata.com/192.168.1.85:8032 14/06/20 12:54:41 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 1 14/06/20 12:54:41 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/06/20 12:54:41 INFO Client: Max mem capabililty of a single resource in this cluster 8192 14/06/20 12:54:41 INFO Client: Preparing Local resources 14/06/20 12:54:41 WARN BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 14/06/20 12:54:41 INFO Client: Uploading hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar to file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 14/06/20 12:54:43 INFO Client: Setting up the launch environment 14/06/20 12:54:43 INFO Client: Setting up container launch context 14/06/20 12:54:43 INFO Client: Command for starting the Spark ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx512m, -Djava.io.tmpdir=$PWD/tmp, -Dspark.akka.retry.wait=\3\, -Dspark.storage.blockManagerTimeoutIntervalMs=\12\, -Dspark.storage.blockManagerHeartBeatMs=\12\, -Dspark.app.name=\org.apache.spark.examples.SparkPi\, -Dspark.akka.frameSize=\1\, -Dspark.akka.timeout=\3\, -Dspark.worker.timeout=\3\, -Dspark.akka.logLifecycleEvents=\true\, -Dlog4j.configuration=log4j-spark-container.properties, org.apache.spark.deploy.yarn.ApplicationMaster, --class, org.apache.spark.examples.SparkPi, --jar , hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar, --args '10' , --executor-memory, 1024, --executor-cores, 1, --num-executors , 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr) 14/06/20 12:54:43 INFO Client: Submitting application to ASM 14/06/20 12:54:43 INFO YarnClientImpl: Submitted application application_1403201750110_0060 14/06/20 12:54:44 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/ appUser: koert 14/06/20 12:54:45 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/ appUser: koert 14/06/20 12:54:46 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/ appUser: koert 14/06/20 12:54:47 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: Application application_1403201750110_0060 failed 2 times due to AM Container for appattempt_1403201750110_0060_02 exited with exitCode: -1000 due to: File file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar does not exist .Failing this attempt.. Failing the application. appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: FAILED distributedFinalState: FAILED appTrackingUrl: cdh5-yarn.tresata.com:8088/cluster/app/application_1403201750110_0060 appUser: koert On Fri, Jun 20, 2014 at 12:42 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote: i noticed
Re: spark on yarn is trying to use file:// instead of hdfs://
ok solved it. as it happened in spark/conf i also had a file called core.site.xml (with some tachyone related stuff in it) so thats why it ignored /etc/hadoop/conf/core-site.xml On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote: i put some logging statements in yarn.Client and that confirms its using local filesystem: 14/06/20 15:20:33 INFO Client: fs.defaultFS is file:/// so somehow fs.defaultFS is not being picked up from /etc/hadoop/conf/core-site.xml, but spark does correctly pick up yarn.resourcemanager.hostname from /etc/hadoop/conf/yarn-site.xml strange! On Fri, Jun 20, 2014 at 1:26 PM, Koert Kuipers ko...@tresata.com wrote: in /etc/hadoop/conf/core-site.xml: property namefs.defaultFS/name valuehdfs://cdh5-yarn.tresata.com:8020/value /property also hdfs seems the default: [koert@cdh5-yarn ~]$ hadoop fs -ls / Found 5 items drwxr-xr-x - hdfs supergroup 0 2014-06-19 12:31 /data drwxrwxrwt - hdfs supergroup 0 2014-06-20 12:17 /lib drwxrwxrwt - hdfs supergroup 0 2014-06-18 14:58 /tmp drwxr-xr-x - hdfs supergroup 0 2014-06-18 15:02 /user drwxr-xr-x - hdfs supergroup 0 2014-06-18 14:59 /var and in my spark-site.env: export HADOOP_CONF_DIR=/etc/hadoop/conf On Fri, Jun 20, 2014 at 1:04 PM, bc Wong bcwal...@cloudera.com wrote: Koert, is there any chance that your fs.defaultFS isn't setup right? On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: yeah sure see below. i strongly suspect its something i misconfigured causing yarn to try to use local filesystem mistakenly. * [koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --executor-cores 1 hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10 14/06/20 12:54:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/20 12:54:40 INFO RMProxy: Connecting to ResourceManager at cdh5-yarn.tresata.com/192.168.1.85:8032 14/06/20 12:54:41 INFO Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 1 14/06/20 12:54:41 INFO Client: Queue info ... queueName: root.default, queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0, queueApplicationCount = 0, queueChildQueueCount = 0 14/06/20 12:54:41 INFO Client: Max mem capabililty of a single resource in this cluster 8192 14/06/20 12:54:41 INFO Client: Preparing Local resources 14/06/20 12:54:41 WARN BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 14/06/20 12:54:41 INFO Client: Uploading hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar to file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 14/06/20 12:54:43 INFO Client: Setting up the launch environment 14/06/20 12:54:43 INFO Client: Setting up container launch context 14/06/20 12:54:43 INFO Client: Command for starting the Spark ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx512m, -Djava.io.tmpdir=$PWD/tmp, -Dspark.akka.retry.wait=\3\, -Dspark.storage.blockManagerTimeoutIntervalMs=\12\, -Dspark.storage.blockManagerHeartBeatMs=\12\, -Dspark.app.name=\org.apache.spark.examples.SparkPi\, -Dspark.akka.frameSize=\1\, -Dspark.akka.timeout=\3\, -Dspark.worker.timeout=\3\, -Dspark.akka.logLifecycleEvents=\true\, -Dlog4j.configuration=log4j-spark-container.properties, org.apache.spark.deploy.yarn.ApplicationMaster, --class, org.apache.spark.examples.SparkPi, --jar , hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar, --args '10' , --executor-memory, 1024, --executor-cores, 1, --num-executors , 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr) 14/06/20 12:54:43 INFO Client: Submitting application to ASM 14/06/20 12:54:43 INFO YarnClientImpl: Submitted application application_1403201750110_0060 14/06/20 12:54:44 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/ appUser: koert 14/06/20 12:54:45 INFO Client: Application report from ASM: application identifier: application_1403201750110_0060 appId: 60 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.koert appMasterRpcPort: -1 appStartTime: 1403283283505 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://cdh5
Re: Running Spark alongside Hadoop
for development/testing i think its fine to run them side by side as you suggested, using spark standalone. just be realistic about what size data you can load with limited RAM. On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: The ideal way to do that is to use a cluster manager like Yarn mesos. You can control how much resources to give to which node etc. You should be able to run both together in standalone mode however you may experience varying latency performance in the cluster as both MR spark demand resources from same machines etc. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 3:41 PM, Sameer Tilak ssti...@live.com wrote: Dear Spark users, I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB memory and 500GB disk. I am currently running Hadoop on it. I would like to run Spark (in standalone mode) along side Hadoop on the same nodes. Given the configuration of my nodes, will that work? Does anyone has any experience in terms of stability and performance of running Spark and Hadoop on somewhat resource-constrained nodes. I was looking at the Spark documentation and there is a way to configure memory and cores for the and worker nodes and memory for the master node: SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, SPARK_DAEMON_MEMORY. Any recommendations on how to share resource between HAdoop and Spark?
Re: little confused about SPARK_JAVA_OPTS alternatives
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub-processes, spark-shell, etc.). i used to do that with SPARK_JAVA_OPTS. now i am unsure. SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for the spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does not seem that useful. for example for a worker it does not apply the settings to the executor sub-processes, while for SPARK_JAVA_OPTS it does do that. so seems like SPARK_JAVA_OPTS is my only way to change settings for the executors, yet its deprecated? On Wed, Jun 11, 2014 at 10:59 PM, elyast lukasz.jastrzeb...@gmail.com wrote: Hi, I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts file to set additional java system properties. In this case I could connect to tachyon without any problem. However when I tried setting executor and driver extraJavaOptions in spark-defaults.conf it doesn't. I suspect the root cause may be following: SparkSubmit doesn't fork additional JVM to actually run either driver or executor process and additional system properties are set after JVM is created and other classes are loaded. It may happen that Tachyon CommonConf class is already being loaded and since its Singleton it won't pick up and changes to system properties. Please let me know what do u think. Can I use conf/java-opts ? since it's not really documented anywhere? Best regards Lukasz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-tp5798p7448.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: trying to understand yarn-client mode
db tsai, if in yarn-cluster mode the driver runs inside yarn, how can you do a rdd.collect and bring the results back to your application? On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote: We are submitting the spark job in our tomcat application using yarn-cluster mode with great success. As Kevin said, yarn-client mode runs driver in your local JVM, and it will have really bad network overhead when one do reduce action which will pull all the result from executor to your local JVM. Also, since you can only have one spark context object in one JVM, it will be tricky to run multiple spark jobs concurrently with yarn-clinet mode. This is how we submit spark job with yarn-cluster mode. Please use the current master code, otherwise, after the job is finished, spark will kill the JVM and exit your app. We setup the configuration of spark in a scala map. def getArgsFromConf(conf: Map[String, String]): Array[String] = { Array[String]( --jar, conf.get(app.jar).getOrElse(), --addJars, conf.get(spark.addJars).getOrElse(), --class, conf.get(spark.mainClass).getOrElse(), --num-executors, conf.get(spark.numWorkers).getOrElse(1), --driver-memory, conf.get(spark.masterMemory).getOrElse(1g), --executor-memory, conf.get(spark.workerMemory).getOrElse(1g), --executor-cores, conf.get(spark.workerCores).getOrElse(1)) } System.setProperty(SPARK_YARN_MODE, true) val sparkConf = new SparkConf val args = getArgsFromConf(conf) new Client(new ClientArguments(args, sparkConf), hadoopConfig, sparkConf).run Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey kevin.mar...@oracle.com wrote: Yarn client is much like Spark client mode, except that the executors are running in Yarn containers managed by the Yarn resource manager on the cluster instead of as Spark workers managed by the Spark master. The driver executes as a local client in your local JVM. It communicates with the workers on the cluster. Transformations are scheduled on the cluster by the driver's logic. Actions involve communication between local driver and remote cluster executors. So, there is some additional network overhead, especially if the driver is not co-located on the cluster. In yarn-cluster mode -- in contrast, the driver is executed as a thread in a Yarn application master on the cluster. In either case, the assembly JAR must be available to the application on the cluster. Best to copy it to HDFS and specify its location by exporting its location as SPARK_JAR. Kevin Markey On 06/19/2014 11:22 AM, Koert Kuipers wrote: i am trying to understand how yarn-client mode works. i am not using spark-submit, but instead launching a spark job from within my own application. i can see my application contacting yarn successfully, but then in yarn i get an immediate error: Application application_1403117970283_0014 failed 2 times due to AM Container for appattempt_1403117970283_0014_02 exited with exitCode: -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not exist .Failing this attempt.. Failing the application. why is yarn trying to fetch my jar, and why as a local file? i would expect the jar to be send to yarn over the wire upon job submission?
Re: trying to understand yarn-client mode
okay. since for us the main purpose is to retrieve (small) data i guess i will stick to yarn client mode. thx On Thu, Jun 19, 2014 at 3:19 PM, DB Tsai dbt...@stanford.edu wrote: Currently, we save the result in HDFS, and read it back in our application. Since Clinet.run is blocking call, it's easy to do it in this way. We are now working on using akka to bring back the result to app without going through the HDFS, and we can use akka to track the log, and stack trace, etc. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote: db tsai, if in yarn-cluster mode the driver runs inside yarn, how can you do a rdd.collect and bring the results back to your application? On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote: We are submitting the spark job in our tomcat application using yarn-cluster mode with great success. As Kevin said, yarn-client mode runs driver in your local JVM, and it will have really bad network overhead when one do reduce action which will pull all the result from executor to your local JVM. Also, since you can only have one spark context object in one JVM, it will be tricky to run multiple spark jobs concurrently with yarn-clinet mode. This is how we submit spark job with yarn-cluster mode. Please use the current master code, otherwise, after the job is finished, spark will kill the JVM and exit your app. We setup the configuration of spark in a scala map. def getArgsFromConf(conf: Map[String, String]): Array[String] = { Array[String]( --jar, conf.get(app.jar).getOrElse(), --addJars, conf.get(spark.addJars).getOrElse(), --class, conf.get(spark.mainClass).getOrElse(), --num-executors, conf.get(spark.numWorkers).getOrElse(1), --driver-memory, conf.get(spark.masterMemory).getOrElse(1g), --executor-memory, conf.get(spark.workerMemory).getOrElse(1g), --executor-cores, conf.get(spark.workerCores).getOrElse(1)) } System.setProperty(SPARK_YARN_MODE, true) val sparkConf = new SparkConf val args = getArgsFromConf(conf) new Client(new ClientArguments(args, sparkConf), hadoopConfig, sparkConf).run Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey kevin.mar...@oracle.com wrote: Yarn client is much like Spark client mode, except that the executors are running in Yarn containers managed by the Yarn resource manager on the cluster instead of as Spark workers managed by the Spark master. The driver executes as a local client in your local JVM. It communicates with the workers on the cluster. Transformations are scheduled on the cluster by the driver's logic. Actions involve communication between local driver and remote cluster executors. So, there is some additional network overhead, especially if the driver is not co-located on the cluster. In yarn-cluster mode -- in contrast, the driver is executed as a thread in a Yarn application master on the cluster. In either case, the assembly JAR must be available to the application on the cluster. Best to copy it to HDFS and specify its location by exporting its location as SPARK_JAR. Kevin Markey On 06/19/2014 11:22 AM, Koert Kuipers wrote: i am trying to understand how yarn-client mode works. i am not using spark-submit, but instead launching a spark job from within my own application. i can see my application contacting yarn successfully, but then in yarn i get an immediate error: Application application_1403117970283_0014 failed 2 times due to AM Container for appattempt_1403117970283_0014_02 exited with exitCode: -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not exist .Failing this attempt.. Failing the application. why is yarn trying to fetch my jar, and why as a local file? i would expect the jar to be send to yarn over the wire upon job submission?
Re: little confused about SPARK_JAVA_OPTS alternatives
for a jvm application its not very appealing to me to use spark submit my application uses hadoop, so i should use hadoop jar, and my application uses spark, so it should use spark-submit. if i add a piece of code that uses some other system there will be yet another suggested way to launch it. thats not very scalable, since i can only launch it one way in the end... On Thu, Jun 19, 2014 at 4:58 PM, Andrew Or and...@databricks.com wrote: Hi Koert and Lukasz, The recommended way of not hard-coding configurations in your application is through conf/spark-defaults.conf as documented here: http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. However, this is only applicable to spark-submit, so this may not be useful to you. Depending on how you launch your Spark applications, you can workaround this by manually specifying these configs as -Dspark.x=y in your java command to launch Spark. This is actually how SPARK_JAVA_OPTS used to work before 1.0. Note that spark-submit does essentially the same thing, but sets these properties programmatically by reading from the conf/spark-defaults.conf file and calling System.setProperty(spark.x, y). Note that spark.executor.extraJavaOpts not intended for spark configuration (see http://spark.apache.org/docs/latest/configuration.html ). SPARK_DAEMON_JAVA_OPTS, as you pointed out, is for Spark daemons like the standalone master, worker, and the history server; it is also not intended for spark configurations to be picked up by Spark executors and drivers. In general, any reference to java opts in any variable or config refers to java options, as the name implies, not Spark configuration. Unfortunately, it just so happened that we used to mix the two in the same environment variable before 1.0. Is there a reason you're not using spark-submit? Is it for legacy reasons? As of 1.0, most changes to launching Spark applications will be done through spark-submit, so you may miss out on relevant new features or bug fixes. Andrew 2014-06-19 7:41 GMT-07:00 Koert Kuipers ko...@tresata.com: still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i would like to be applied to every piece of the spark framework (so spark master, spark workers, spark executor sub-processes, spark-shell, etc.). i used to do that with SPARK_JAVA_OPTS. now i am unsure. SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for the spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does not seem that useful. for example for a worker it does not apply the settings to the executor sub-processes, while for SPARK_JAVA_OPTS it does do that. so seems like SPARK_JAVA_OPTS is my only way to change settings for the executors, yet its deprecated? On Wed, Jun 11, 2014 at 10:59 PM, elyast lukasz.jastrzeb...@gmail.com wrote: Hi, I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts file to set additional java system properties. In this case I could connect to tachyon without any problem. However when I tried setting executor and driver extraJavaOptions in spark-defaults.conf it doesn't. I suspect the root cause may be following: SparkSubmit doesn't fork additional JVM to actually run either driver or executor process and additional system properties are set after JVM is created and other classes are loaded. It may happen that Tachyon CommonConf class is already being loaded and since its Singleton it won't pick up and changes to system properties. Please let me know what do u think. Can I use conf/java-opts ? since it's not really documented anywhere? Best regards Lukasz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-tp5798p7448.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: life if an executor
if they are tied to the spark context, then why can the subprocess not be started up with the extra jars (sc.addJars) already on class path? this way a switch like user-jars-first would be a simple rearranging of the class path for the subprocess, and the messing with classloaders that is currently done in executor (which forces people to use reflection is certain situations and is broken if you want user jars first) would be history On May 20, 2014 1:07 AM, Matei Zaharia matei.zaha...@gmail.com wrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: life if an executor
just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.comwrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: life if an executor
interesting, so it sounds to me like spark is forced to choose between the ability to add jars during lifetime and the ability to run tasks with user classpath first (which important for the ability to run jobs on spark clusters not under your control, so for the viability of 3rd party spark apps) On Tue, May 20, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote: One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote: I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one. On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
Re: File present but file not found exception
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote: I found that if a file is present in all the nodes in the given path in localFS, then reading is possible. But is there a way to read if the file is present only in certain nodes ?? [There should be a way !!] *NEED: Wanted to do some filter ops in HDFS file, create a local file of the result, create an RDD out of it operate * Is there any way out ?? Thanks in advance ! On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at the log files. *var file = sc.textFile(file:///home/sparkcluster/spark/input.txt);* *file.top(1);* input.txt exists in the above folder but still Spark coudnt find it. Some parameters need to be set ?? Any help is really appreciated. Thanks !!
life if an executor
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx
cant get tests to pass anymore on master master
i used to be able to get all tests to pass. with java 6 and sbt i get PermGen errors (no matter how high i make the PermGen). so i have given up on that. with java 7 i see 1 error in a bagel test and a few in streaming tests. any ideas? see the error in BagelSuite below. [info] - large number of iterations *** FAILED *** (10 seconds, 105 milliseconds) [info] The code passed to failAfter did not complete within 10 seconds. (BagelSuite.scala:85) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282) [info] at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246) [info] at org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) [info] at org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
Re: cant get tests to pass anymore on master master
i tried on a few different machines, including a server, all same ubuntu and same java, and got same errors. i also tried modifying the timeouts in the unit tests and it did not help. ok i will try blowing away local maven repo and do clean. On Thu, May 15, 2014 at 12:49 PM, Sean Owen so...@cloudera.com wrote: Since the error concerns a timeout -- is the machine slowish? What about blowing away everything in your local maven repo, do a clean, etc. to rule out environment issues? I'm on OS X here FWIW. On Thu, May 15, 2014 at 5:24 PM, Koert Kuipers ko...@tresata.com wrote: yeah sure. it is ubuntu 12.04 with jdk1.7.0_40 what else is relevant that i can provide? On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote: FWIW I see no failures. Maybe you can say more about your environment, etc. On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko...@tresata.com wrote: i used to be able to get all tests to pass. with java 6 and sbt i get PermGen errors (no matter how high i make the PermGen). so i have given up on that. with java 7 i see 1 error in a bagel test and a few in streaming tests. any ideas? see the error in BagelSuite below. [info] - large number of iterations *** FAILED *** (10 seconds, 105 milliseconds) [info] The code passed to failAfter did not complete within 10 seconds. (BagelSuite.scala:85) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282) [info] at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246) [info] at org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) [info] at org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
Re: java serialization errors with spark.files.userClassPathFirst=true
after removing all class paramater of class Path from my code, i tried again. different but related eror when i set spark.files.userClassPathFirst=true now i dont even use FileInputFormat directly. HadoopRDD does... 14/05/16 12:17:17 ERROR Executor: Exception in task ID 45 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/FileInputFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:792) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:51) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:57) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1610) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1481) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1331) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) On Thu, May 15, 2014 at 3:03 PM, Koert Kuipers ko...@tresata.com wrote: when i set spark.files.userClassPathFirst=true, i get java serialization errors in my tasks, see below. when i set userClassPathFirst back to its default of false, the serialization errors are gone. my spark.serializer is KryoSerializer. the class org.apache.hadoop.fs.Path is in the spark assembly jar, but not in my task jars (the ones i added to the SparkConf). so looks like the ClosureSerializer is having trouble with this class once the ChildExecutorURLClassLoader is used? thats me just guessing. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:5 failed 4 times, most recent failure: Exception failure in TID 31 on host node05.tresata.com: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/Path java.lang.Class.getDeclaredConstructors0(Native Method) java.lang.Class.privateGetDeclaredConstructors(Class.java:2398) java.lang.Class.getDeclaredConstructors(Class.java:1838) java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1697) java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:50) java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:203) java.security.AccessController.doPrivileged(Native Method) java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:200) java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:556) java.io.ObjectInputStream.readNonProxyDesc
writing my own RDD
in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark.
Re: cant get tests to pass anymore on master master
yeah sure. it is ubuntu 12.04 with jdk1.7.0_40 what else is relevant that i can provide? On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote: FWIW I see no failures. Maybe you can say more about your environment, etc. On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko...@tresata.com wrote: i used to be able to get all tests to pass. with java 6 and sbt i get PermGen errors (no matter how high i make the PermGen). so i have given up on that. with java 7 i see 1 error in a bagel test and a few in streaming tests. any ideas? see the error in BagelSuite below. [info] - large number of iterations *** FAILED *** (10 seconds, 105 milliseconds) [info] The code passed to failAfter did not complete within 10 seconds. (BagelSuite.scala:85) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250) [info] at org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282) [info] at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246) [info] at org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85) [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974) [info] at org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
Re: java serialization errors with spark.files.userClassPathFirst=true
well, i modified ChildExecutorURLClassLoader to also delegate to parentClassloader if NoClassDefFoundError is thrown... now i get yet another error. i am clearly missing something with these classloaders. such nasty stuff... giving up for now. just going to have to not use spark.files.userClassPathFirst=true for now, until i have more time to look at this. 14/05/16 13:58:59 ERROR Executor: Exception in task ID 3 java.lang.ClassCastException: cannot assign instance of scala.None$ to field org.apache.spark.rdd.RDD.checkpointData of type scala.Option in instance of MyRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1995) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:60) On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.com wrote: after removing all class paramater of class Path from my code, i tried again. different but related eror when i set spark.files.userClassPathFirst=true now i dont even use FileInputFormat directly. HadoopRDD does... 14/05/16 12:17:17 ERROR Executor: Exception in task ID 45 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/FileInputFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:792) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:51) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:57) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1610) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1481) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1331) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913
Re: java serialization errors with spark.files.userClassPathFirst=true
i do not think the current solution will work. i tried writing a version of ChildExecutorURLClassLoader that does have a proper parent and has a modified loadClass to reverse the order of parent and child in finding classes, and that seems to work, but now classes like SparkEnv are loaded by the child and somehow this means the companion objects are reset or something like that because i get NPEs. On Fri, May 16, 2014 at 3:54 PM, Koert Kuipers ko...@tresata.com wrote: ok i think the issue is visibility: a classloader can see all classes loaded by its parent classloader. but userClassLoader does not have a parent classloader, so its not able to see any classes that parentLoader is responsible for. in my case userClassLoader is trying to get AvroInputFormat which probably somewhere statically references FileInputFormat, which is invisible to userClassLoader. On Fri, May 16, 2014 at 3:32 PM, Koert Kuipers ko...@tresata.com wrote: ok i put lots of logging statements in the ChildExecutorURLClassLoader. this is what i see: * the urls for userClassLoader are correct and includes only my one jar. * for one class that only exists in my jar i see it gets loaded correctly using userClassLoader * for a class that exists in both my jar and spark kernel it tries to use userClassLoader and ends up with a NoClassDefFoundError. the class is org.apache.avro.mapred.AvroInputFormat and the NoClassDefFoundError is for org.apache.hadoop.mapred.FileInputFormat (which the parentClassLoader is responsible for since it is not in my jar). i currently catch this NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly not a solution since it loads the wrong version. On Fri, May 16, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: well, i modified ChildExecutorURLClassLoader to also delegate to parentClassloader if NoClassDefFoundError is thrown... now i get yet another error. i am clearly missing something with these classloaders. such nasty stuff... giving up for now. just going to have to not use spark.files.userClassPathFirst=true for now, until i have more time to look at this. 14/05/16 13:58:59 ERROR Executor: Exception in task ID 3 java.lang.ClassCastException: cannot assign instance of scala.None$ to field org.apache.spark.rdd.RDD.checkpointData of type scala.Option in instance of MyRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1995) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:60) On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.comwrote: after removing all class paramater of class Path from my code, i tried again. different but related eror when i set spark.files.userClassPathFirst=true now i dont even use FileInputFormat directly. HadoopRDD does... 14/05/16 12:17:17 ERROR Executor: Exception in task ID 45 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred
Re: little confused about SPARK_JAVA_OPTS alternatives
hey patrick, i have a SparkConf i can add them too. i was looking for a way to do this where they are not hardwired within scala, which is what SPARK_JAVA_OPTS used to do. i guess if i just set -Dspark.akka.frameSize=1 on my java app launch then it will get picked up by the SparkConf too right? On Wed, May 14, 2014 at 2:54 PM, Patrick Wendell pwend...@gmail.com wrote: Just wondering - how are you launching your application? If you want to set values like this the right way is to add them to the SparkConf when you create a SparkContext. val conf = new SparkConf().set(spark.akka.frameSize, 1).setAppName(...).setMaster(...) val sc = new SparkContext(conf) - Patrick On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote: i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1 now this is deprecated. the alternatives mentioned are: * some spark-submit settings which are not relevant to me since i do not use spark-submit (i launch spark jobs from an existing application) * spark.executor.extraJavaOptions to set -X options. i am not sure what -X options are, but it doesnt sound like what i need, since its only for executors * SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e. master, worker), that sounds like i should not use it since i am trying to change settings for an app, not a daemon. am i missing the correct setting to use? should i do -Dspark.akka.frameSize=1 on my application launch directly, and then also set spark.executor.extraJavaOptions? so basically repeat it?
little confused about SPARK_JAVA_OPTS alternatives
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1 now this is deprecated. the alternatives mentioned are: * some spark-submit settings which are not relevant to me since i do not use spark-submit (i launch spark jobs from an existing application) * spark.executor.extraJavaOptions to set -X options. i am not sure what -X options are, but it doesnt sound like what i need, since its only for executors * SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e. master, worker), that sounds like i should not use it since i am trying to change settings for an app, not a daemon. am i missing the correct setting to use? should i do -Dspark.akka.frameSize=1 on my application launch directly, and then also set spark.executor.extraJavaOptions? so basically repeat it?
Re: writing my own RDD
resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark.
Re: writing my own RDD
will do On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote: You got a good point there, those APIs should probably be marked as @DeveloperAPI. Would you mind filing a JIRA for that ( https://issues.apache.org/jira/browse/SPARK)? On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers ko...@tresata.com wrote: resending... my email somehow never made it to the user list. On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote: in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark.
Re: os buffer cache does not cache shuffle output file
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote: is there something wrong with the mailing list? very few people see my thread -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: performance improvement on second operation...without caching?
Hey Matei, Not sure i understand that. These are 2 separate jobs. So the second job takes advantage of the fact that there is map output left somewhere on disk from the first job, and re-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw one stage executing. These files are saved for fault recovery but they speed up subsequent runs. Matei On May 3, 2014, at 5:21 PM, Patrick Wendell pwend...@gmail.com wrote: Ethan, What you said is actually not true, Spark won't cache RDD's unless you ask it to. The observation here - that running the same job can speed up substantially even without caching - is common. This is because other components in the stack are performing caching and optimizations. Two that can make a huge difference are: 1. The OS buffer cache. Which will keep recently read disk blocks in memory. 2. The Java just-in-time compiler (JIT) which will use runtime profiling to significantly speed up execution speed. These can make a huge difference if you are running the same job over-and-over. And there are other things like the OS network stack increasing TCP windows and so fourth. These will all improve response time as a spark program executes. On Fri, May 2, 2014 at 9:27 AM, Ethan Jewett esjew...@gmail.com wrote: I believe Spark caches RDDs it has memory for regardless of whether you actually call the 'cache' method on the RDD. The 'cache' method just tips off Spark that the RDD should have higher priority. At least, that is my experience and it seems to correspond with your experience and with my recollection of other discussions on this topic on the list. However, going back and looking at the programming guide, this is not the way the cache/persist behavior is described. Does the guide need to be updated? On Fri, May 2, 2014 at 9:04 AM, Diana Carroll dcarr...@cloudera.comwrote: I'm just Posty McPostalot this week, sorry folks! :-) Anyway, another question today: I have a bit of code that is pretty time consuming (pasted at the end of the message): It reads in a bunch of XML files, parses them, extracts some data in a map, counts (using reduce), and then sorts. All stages are executed when I do a final operation (take). The first stage is the most expensive: on first run it takes 30s to a minute. I'm not caching anything. When I re-execute that take at the end, I expected it to re-execute all the same stages, and take approximately the same amount of time, but it didn't. The second take executes only a single stage which collectively run very fast: the whole operation takes less than 1 second (down from 5 minutes!) While this is awesome (!) I don't understand it. If I'm not caching data, why would I see such a marked performance improvement on subsequent execution? (or is this related to the known .9.1 bug about sortByKey executing an action when it shouldn't?) Thanks, Diana sparkdev_04-23_KEEP_FOR_BUILDS.png # load XML files containing device activation records. # Find the most common device models activated import xml.etree.ElementTree as ElementTree # Given a partition containing multi-line XML, parse the contents. # Return an iterator of activation Elements contained in the partition def getactivations(fileiterator): s = '' for i in fileiterator: s = s + str(i) filetree = ElementTree.fromstring(s) return filetree.getiterator('activation') # Get the model name from a device activation record def getmodel(activation): return activation.find('model').text filename=hdfs://localhost/user/training/activations/*.xml # parse each partition as a file into an activation XML record activations = sc.textFile(filename) activationTrees = activations.mapPartitions(lambda xml: getactivations(xml)) models = activationTrees.map(lambda activation: getmodel(activation)) # count and sort activations by model topmodels = models.map(lambda model: (model,1))\ .reduceByKey(lambda v1,v2: v1+v2)\ .map(lambda (model,count): (count,model))\ .sortByKey(ascending=False) # display the top 10 models for (count,model) in topmodels.take(10): print Model %s (%s) % (model,count) # repeat! for (count,model) in topmodels.take(10): print Model %s (%s) % (model,count)
Re: Spark: issues with running a sbt fat jar due to akka dependencies
not sure why applying concat to reference. conf didn't work for you. since it simply concatenates the files the key akka.version should be preserved. we had the same situation for a while without issues. On May 1, 2014 8:46 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello Koert, That did not work. I specified it in my email already. But I figured a way around it by excluding akka dependencies Shivani On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers ko...@tresata.com wrote: you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case reference.conf = MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.comwrote: Hello folks, I was going to post this question to spark user group as well. If you have any leads on how to solve this issue please let me know: I am trying to build a basic spark project (spark depends on akka) and I am trying to create a fatjar using sbt assembly. The goal is to run the fatjar via commandline as follows: java -cp path to my spark fatjar mainclassname I encountered deduplication errors in the following akka libraries during sbt assembly akka-remote_2.10-2.2.3.jar with akka-remote_2.10-2.2.3-shaded-protobuf.jar akka-actor_2.10-2.2.3.jar with akka-actor_2.10-2.2.3-shaded-protobuf.jar I resolved them by using MergeStrategy.first and that helped with a successful compilation of the sbt assembly command. But for some or the other configuration parameter in the akka kept throwing up with the following message Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key I then used MergeStrategy.concat for reference.conf and I started getting this repeated error Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'. I noticed that akka.version is only in the akka-actor jars and not in the akka-remote. The resulting reference.conf (in my final fat jar) does not contain akka.version either. So the strategy is not working. There are several things I could try a) Use the following dependency https://github.com/sbt/sbt-proguard b) Write a build.scala to handle merging of reference.conf https://spark-project.atlassian.net/browse/SPARK-395 http://letitcrash.com/post/21025950392/howto-sbt-assembly-vs-reference-conf c) Create a reference.conf by merging all akka configurations and then passing it in my java -cp command as shown below java -cp jar-name -DConfig.file=config The main issue is that if I run the spark jar as sbt run there are no errors in accessing any of the akka configuration parameters. It is only when I run it via command line (java -cp jar-name classname) that I encounter the error. Which of these is a long term fix to akka issues? For now, I removed the akka dependencies and that solved the problem, but I know that is not a long term solution Regards, Shivani -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA
Re: Storage information about an RDD from the API
SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Is it possible to know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? I know I can get the storage level, but I also want to know the current actual caching status. Knowing memory consumption would also be awesome. :) Basically what I'm looking for is the information on the storage tab of the UI, but accessible from the API. Thanks, Andras
Re: Spark: issues with running a sbt fat jar due to akka dependencies
you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case reference.conf = MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello folks, I was going to post this question to spark user group as well. If you have any leads on how to solve this issue please let me know: I am trying to build a basic spark project (spark depends on akka) and I am trying to create a fatjar using sbt assembly. The goal is to run the fatjar via commandline as follows: java -cp path to my spark fatjar mainclassname I encountered deduplication errors in the following akka libraries during sbt assembly akka-remote_2.10-2.2.3.jar with akka-remote_2.10-2.2.3-shaded-protobuf.jar akka-actor_2.10-2.2.3.jar with akka-actor_2.10-2.2.3-shaded-protobuf.jar I resolved them by using MergeStrategy.first and that helped with a successful compilation of the sbt assembly command. But for some or the other configuration parameter in the akka kept throwing up with the following message Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key I then used MergeStrategy.concat for reference.conf and I started getting this repeated error Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'. I noticed that akka.version is only in the akka-actor jars and not in the akka-remote. The resulting reference.conf (in my final fat jar) does not contain akka.version either. So the strategy is not working. There are several things I could try a) Use the following dependency https://github.com/sbt/sbt-proguard b) Write a build.scala to handle merging of reference.conf https://spark-project.atlassian.net/browse/SPARK-395 http://letitcrash.com/post/21025950392/howto-sbt-assembly-vs-reference-conf c) Create a reference.conf by merging all akka configurations and then passing it in my java -cp command as shown below java -cp jar-name -DConfig.file=config The main issue is that if I run the spark jar as sbt run there are no errors in accessing any of the akka configuration parameters. It is only when I run it via command line (java -cp jar-name classname) that I encounter the error. Which of these is a long term fix to akka issues? For now, I removed the akka dependencies and that solved the problem, but I know that is not a long term solution Regards, Shivani -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA
Re: ui broken in latest 1.0.0
got it. makes sense. i am surprised it worked before... On Apr 18, 2014 9:12 PM, Andrew Or and...@databricks.com wrote: Hi Koert, I've tracked down what the bug is. The caveat is that each StageInfo only keeps around the RDDInfo of the last RDD associated with the Stage. More concretely, if you have something like sc.parallelize(1 to 1000).persist.map(i = (i, i)).count() This creates two RDDs within one Stage, and the persisted RDD doesn't show up on the UI because it is not the last RDD of this stage. I filed a JIRA for this here: https://issues.apache.org/jira/browse/SPARK-1538. Thanks again for reporting this. I will push out a fix shortly. Andrew On Tue, Apr 8, 2014 at 1:30 PM, Koert Kuipers ko...@tresata.com wrote: our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) storageStatusList: List(StorageStatus(BlockManagerId(driver, 192.168.3.169, 34330, 0),579325132,Map())) *** onStageCompleted ** _rddInfoMap: Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** updateRDDInfo ** *** onTaskEnd ** _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) storageStatusList: List(StorageStatus(BlockManagerId(driver, 192.168.3.169, 34330, 0),579325132,Map(rdd_3_0 - BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0 *** onStageCompleted ** _rddInfoMap: Map() On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote: 1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Thanks for pointing this out. However, I am unable to reproduce this locally. It seems that there is a discrepancy between what the BlockManagerUI and the SparkContext think is persisted. This is strange because both sources ultimately derive this information from the same place - by doing sc.getExecutorStorageStatus. I have a couple of questions for you: 1) In your print statements, do you print them in the beginning or at the end of each callback? It would be good to keep them at the end, since in the beginning the data structures have not been processed yet. 2) You mention that you get the RDD info through your own API. How do you get this information? Is it through sc.getRDDStorageInfo? 3) What did your application do to produce this behavior? Did you make an RDD, persist it once, and then use it many times afterwards or something similar? It would be super helpful if you could also print out what StorageStatusListener's storageStatusList looks like by the end of each onTaskEnd. I will continue to look into this on my side, but do let me know once you have any updates. Andrew On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.comwrote: yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.comwrote: i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions
Re: ui broken in latest 1.0.0
i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
assumption that lib_managed is present
when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of lib_managed before doing ls on it?
Re: ui broken in latest 1.0.0
note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
yes i call an action after cache, and i can see that the RDDs are fully cached using context.getRDDStorageInfo which we expose via our own api. i did not run make-distribution.sh, we have our own scripts to build a distribution. however if your question is if i correctly deployed the latest build, let me check to be sure. On Tue, Apr 8, 2014 at 12:43 PM, Xiangrui Meng men...@gmail.com wrote: That commit did work for me. Could you confirm the following: 1) After you called cache(), did you make any actions like count() or reduce()? If you don't materialize the RDD, it won't show up in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master? Xiangrui On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() The storagelevels you see here are never the ones of my RDDs. and apparently updateRDDInfo never gets called (i had println in there too). On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote: yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() The storagelevels you see here are never the ones of my RDDs. and apparently updateRDDInfo never gets called (i had println in there too). On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote: yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote: sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing Author: Andrew Or andrewo...@gmail.com Closes #281 from andrewor14/ui-storage-fix and squashes the following commits: 408585a [Andrew Or] Fix storage UI bug On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote: got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote: This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote: i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the application detail ui no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.
Re: ui broken in latest 1.0.0
1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Thanks for pointing this out. However, I am unable to reproduce this locally. It seems that there is a discrepancy between what the BlockManagerUI and the SparkContext think is persisted. This is strange because both sources ultimately derive this information from the same place - by doing sc.getExecutorStorageStatus. I have a couple of questions for you: 1) In your print statements, do you print them in the beginning or at the end of each callback? It would be good to keep them at the end, since in the beginning the data structures have not been processed yet. 2) You mention that you get the RDD info through your own API. How do you get this information? Is it through sc.getRDDStorageInfo? 3) What did your application do to produce this behavior? Did you make an RDD, persist it once, and then use it many times afterwards or something similar? It would be super helpful if you could also print out what StorageStatusListener's storageStatusList looks like by the end of each onTaskEnd. I will continue to look into this on my side, but do let me know once you have any updates. Andrew On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote: yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() The storagelevels you see here are never the ones of my RDDs. and apparently updateRDDInfo never gets called (i had println in there too). On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote: yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.comwrote: sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote: i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used If a previously persisted RDD is re-used, its information disappears from the Storage page. This is because
Re: ui broken in latest 1.0.0
our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) storageStatusList: List(StorageStatus(BlockManagerId(driver, 192.168.3.169, 34330, 0),579325132,Map())) *** onStageCompleted ** _rddInfoMap: Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** updateRDDInfo ** *** onTaskEnd ** _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) storageStatusList: List(StorageStatus(BlockManagerId(driver, 192.168.3.169, 34330, 0),579325132,Map(rdd_3_0 - BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0 *** onStageCompleted ** _rddInfoMap: Map() On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote: 1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Thanks for pointing this out. However, I am unable to reproduce this locally. It seems that there is a discrepancy between what the BlockManagerUI and the SparkContext think is persisted. This is strange because both sources ultimately derive this information from the same place - by doing sc.getExecutorStorageStatus. I have a couple of questions for you: 1) In your print statements, do you print them in the beginning or at the end of each callback? It would be good to keep them at the end, since in the beginning the data structures have not been processed yet. 2) You mention that you get the RDD info through your own API. How do you get this information? Is it through sc.getRDDStorageInfo? 3) What did your application do to produce this behavior? Did you make an RDD, persist it once, and then use it many times afterwards or something similar? It would be super helpful if you could also print out what StorageStatusListener's storageStatusList looks like by the end of each onTaskEnd. I will continue to look into this on my side, but do let me know once you have any updates. Andrew On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote: yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onTaskEnd ** Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B) *** onStageCompleted ** Map() *** onStageSubmitted ** rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B
RDDInfo visibility SPARK-1132
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds
Re: RDDInfo visibility SPARK-1132
ok yeah we are using StageInfo and TaskInfo too... On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Other users have expressed interest for us to expose similar classes too (i.e. StageInfo, TaskInfo). In the newest release, they will be available as part of the developer API. The particular PR that will change this is: https://github.com/apache/spark/pull/274. Cheers, Andrew On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote: any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds
Re: Generic types and pair RDDs
import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD import scala.reflect.ClassTag def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = { rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) } } On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann daniel.siegm...@velos.iowrote: When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values): def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = { rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) } } That works fine, but lets say I replace the type of the key with a generic type: def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = { rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) } } This latter function gets the compiler error value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]. The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: 答复: unable to build spark - sbt/sbt: line 50: killed
i have found that i am unable to build/test spark with sbt and java6, but using java7 works (and it compiles with java target version 1.6 so binaries are usable from java 6) On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote: Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak usage was around 780MB. -- To: user@spark.apache.org From: vboylin1...@gmail.com Subject: 答复: unable to build spark - sbt/sbt: line 50: killed Date: Sat, 22 Mar 2014 20:03:28 +0800 Large memory is need to build spark, I think you should make xmx larger, 2g for example. -- 发件人: Bharath Bhushan manku.ti...@outlook.com 发送时间: 2014/3/22 12:50 收件人: user@spark.apache.org 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes... sbt/sbt: line 50: 10202 Killed java -Xmx1900m -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@ Versions of software: Spark: 0.9.0 (hadoop2 binary) Scala: 2.10.3 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic Java: 1.6.0_45 (oracle java 6) I can still use the binaries in bin/ but I was just trying to check if sbt/sbt assembly works fine. -- Thanks
Re: Spark enables us to process Big Data on an ARM cluster !!
i dont know anything about arm clusters but it looks great. what are the specs? the nodes have no local disk at all? On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote: Hi all, We are a small team doing a research on low-power (and low-cost) ARM clusters. We built a 20-node ARM cluster that be able to start Hadoop. But as all of you've known, Hadoop is performing on-disk operations, so it's not suitable for a constraint machine powered by ARM. We then switched to Spark and had to say wow!! Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of size 34GB in 1h50m. We have identified the bottleneck and it's our 100M network. Here's the cluster: https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png And this is what we got from Spark's shell: https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png I think it's the first ARM cluster that can process a non-trivial size of Big Data. (Please correct me if I'm wrong) I really want to thank the Spark team that makes this possible !! Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit
Re: trying to understand job cancellation
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no issues yet. On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote: its 0.9 snapshot from january running in standalone mode. have these fixed been merged into 0.9? On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Which version of Spark is this in, Koert? There might have been some fixes more recently for it. Matei On Mar 5, 2014, at 5:26 PM, Koert Kuipers ko...@tresata.com wrote: Sorry I meant to say: seems the issue is shared RDDs between a job that got cancelled and a later job. However even disregarding that I have the other issue that the active task of the cancelled job hangs around forever, not doing anything On Mar 5, 2014 7:29 PM, Koert Kuipers ko...@tresata.com wrote: yes jobs on RDDs that were not part of the cancelled job work fine. so it seems the issue is the cached RDDs that are ahred between the cancelled job and the jobs after that. On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers ko...@tresata.com wrote: well, the new jobs use existing RDDs that were also used in the jon that got killed. let me confirm that new jobs that use completely different RDDs do not get killed. On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Quite unlikely as jobid are given in an incremental fashion, so your future jobid are not likely to be killed if your groupid is not repeated.I guess the issue is something else. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers ko...@tresata.comwrote: i did that. my next job gets a random new group job id (a uuid). however that doesnt seem to stop the job from getting sucked into the cancellation it seems On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: You can randomize job groups as well. to secure yourself against termination. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers ko...@tresata.comwrote: got it. seems like i better stay away from this feature for now.. On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation event hence the job cancellation event may end up closing them too. So there's definitely a race condition you are risking even if not running into. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers ko...@tresata.comwrote: SparkContext.cancelJobGroup On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: How do you cancel the job. Which API do you use? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com wrote: i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever. On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote: i have a running job that i cancel while keeping the spark context alive. at the time of cancellation the active stage is 14. i see in logs: 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 from pool x 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15 so far it all looks good. then i get a lot of messages like this: 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with state FINISHED from TID 883 because its task set is gone 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with state KILLED from TID 888 because its task set is gone after this stage 14 hangs around in active stages, without any sign of progress or cancellation. it just sits there forever, stuck. looking at the logs of the executors confirms this. they task seem to be still running, but nothing
combining operations elegantly
not that long ago there was a nice example on here about how to combine multiple operations on a single RDD. so basically if you want to do a count() and something else, how to roll them into a single job. i think patrick wendell gave the examples. i cant find them anymore patrick can you please repost? thanks!
Re: Sbt Permgen
hey sandy, i think that pulreq is not relevant to the 0.9 branch i am using switching to java 7 for sbt/sbt test made it work. not sure why... On Sun, Mar 9, 2014 at 11:44 PM, Sandy Ryza sandy.r...@cloudera.com wrote: There was an issue related to this fixed recently: https://github.com/apache/spark/pull/103 On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote: edit last line of sbt/sbt, after which i run: sbt/sbt test On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote: How are you specifying these args? On Mar 9, 2014 8:55 PM, Koert Kuipers ko...@tresata.com wrote: i just checkout out the latest 0.9 no matter what java options i use in sbt/sbt (i tried -Xmx6G -XX:MaxPermSize=2000m -XX:ReservedCodeCacheSize=300m) i keep getting errors java.lang.OutOfMemoryError: PermGen space when running the tests. curiously i managed to run the tests with the default dependencies, but with cdh4.5.0 mr1 dependencies i always hit the dreaded Permgen space issue. Any suggestions?
computation slows down 10x because of cached RDDs
hello all, i am observing a strange result. i have a computation that i run on a cached RDD in spark-standalone. it typically takes about 4 seconds. but when other RDDs that are not relevant to the computation at hand are cached in memory (in same spark context), the computation takes 40 seconds or more. the problem seems to be GC time, which goes from milliseconds to tens of seconds. note that my issue is not that memory is full. i have cached about 14G in RDDs with 66G available across workers for the application. also my computation did not push any cached RDD out of memory. any ideas?
Re: computation slows down 10x because of cached RDDs
hey matei, it happens repeatedly. we are currently runnning on java 6 with spark 0.9. i will add -XX:+PrintGCDetails and collect details, and also look into java 7 G1. thanks On Mon, Mar 10, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Does this happen repeatedly if you keep running the computation, or just the first time? It may take time to move these Java objects to the old generation the first time you run queries, which could lead to a GC pause that also slows down the small queries. If you can run with -XX:+PrintGCDetails in your Java options, it would also be good to see what percent of each GC generation is used. The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in Java 7 (-XX:+UseG1GC) might also avoid these pauses by GCing concurrently with your application threads. Matei On Mar 10, 2014, at 3:18 PM, Koert Kuipers ko...@tresata.com wrote: hello all, i am observing a strange result. i have a computation that i run on a cached RDD in spark-standalone. it typically takes about 4 seconds. but when other RDDs that are not relevant to the computation at hand are cached in memory (in same spark context), the computation takes 40 seconds or more. the problem seems to be GC time, which goes from milliseconds to tens of seconds. note that my issue is not that memory is full. i have cached about 14G in RDDs with 66G available across workers for the application. also my computation did not push any cached RDD out of memory. any ideas?
Re: trying to understand job cancellation
i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever. On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote: i have a running job that i cancel while keeping the spark context alive. at the time of cancellation the active stage is 14. i see in logs: 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 from pool x 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15 so far it all looks good. then i get a lot of messages like this: 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with state FINISHED from TID 883 because its task set is gone 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with state KILLED from TID 888 because its task set is gone after this stage 14 hangs around in active stages, without any sign of progress or cancellation. it just sits there forever, stuck. looking at the logs of the executors confirms this. they task seem to be still running, but nothing is happening. for example (by the time i look at this its 4:58 so this tasks hasnt done anything in 15 mins): 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver 14/03/04 16:43:16 INFO Executor: Finished task ID 943 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver 14/03/04 16:43:16 INFO Executor: Finished task ID 945 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 not sure what to make of this. any suggestions? best, koert
Re: trying to understand job cancellation
got it. seems like i better stay away from this feature for now.. On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation event hence the job cancellation event may end up closing them too. So there's definitely a race condition you are risking even if not running into. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers ko...@tresata.com wrote: SparkContext.cancelJobGroup On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: How do you cancel the job. Which API do you use? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com wrote: i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever. On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.comwrote: i have a running job that i cancel while keeping the spark context alive. at the time of cancellation the active stage is 14. i see in logs: 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 from pool x 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15 so far it all looks good. then i get a lot of messages like this: 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with state FINISHED from TID 883 because its task set is gone 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with state KILLED from TID 888 because its task set is gone after this stage 14 hangs around in active stages, without any sign of progress or cancellation. it just sits there forever, stuck. looking at the logs of the executors confirms this. they task seem to be still running, but nothing is happening. for example (by the time i look at this its 4:58 so this tasks hasnt done anything in 15 mins): 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver 14/03/04 16:43:16 INFO Executor: Finished task ID 943 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver 14/03/04 16:43:16 INFO Executor: Finished task ID 945 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 not sure what to make of this. any suggestions? best, koert
Re: Job initialization performance of Spark standalone mode vs YARN
If you need quick response re-use your spark context between queries and cache rdds in memory On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing at capacity in under a second :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Job initialization performance of Spark standalone mode vs YARN
yes, tachyon is in memory serialized, which is not as fast as cached in memory in spark (not serialized). the difference really depends on your job type. On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related topic, has anyone had any experience running Spark on Tachyon in-memory filesystem, and could offer their views on using it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html Sent from the Apache Spark User List mailing list archive at Nabble.com.