Spark assembly for YARN/CDH5
Does anyone know if there Spark assemblies are created and available for download that have been built for CDH5 and YARN? Thanks, Philip - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: creating a distributed index
After playing around with mapPartition I think this does exactly what I want. I can pass in a function to mapPartition that looks like this: def f1(iter: Iterator[String]): Iterator[MyIndex] = { val idx: MyIndex = new MyIndex() while (iter.hasNext) { val text: String = iter.next; idx.add(text) } List(idx).iterator } val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true) Now I can perform a match operation like so: val matchRdd: RDD[Match] = indexRDD.flatMap(index = index.`match`(myquery)) I'm sure it won't take much imagination to figure out how to the the matching in a batch way. If anyone has done anything along these lines I'd love to have some feedback. Thanks, Philip On 08/04/2014 09:46 AM, Philip Ogren wrote: This looks like a really cool feature and it seems likely that this will be extremely useful for things we are doing. However, I'm not sure it is quite what I need here. With an inverted index you don't actually look items up by their keys but instead try to match against some input string. So, if I created an inverted index on 1M strings/documents in a single JVM I could subsequently submit a string-valued query to retrieve the n-best matches. I'd like to do something like build ten 1M string/document indexes across ten nodes, submit a string-valued query to each of the ten indexes and aggregate the n-best matches from the ten sets of results. Would this be possible with IndexedRDD or some other feature of Spark? Thanks, Philip On 08/01/2014 04:26 PM, Ankur Dave wrote: At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com wrote: It seems that I could do this with mapPartition so that each element in a partition gets added to an index for that partition. [...] Would it then be possible to take a string and query each partition's index with it? Or better yet, take a batch of strings and query each string in the batch against each partition's index? I proposed a key-value store based on RDDs called IndexedRDD that does exactly what you described. It uses mapPartitions to construct an index within each partition, then exposes get and multiget methods to allow looking up values associated with given keys. It will hopefully make it into Spark 1.2.0. Until then you can try it out by merging in the pull request locally: https://github.com/apache/spark/pull/1297. See JIRA for details and slides on how it works: https://issues.apache.org/jira/browse/SPARK-2365. Ankur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
relationship of RDD[Array[String]] to Array[Array[String]]
It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or a local collection without having to have parallel implementations? I can't tell that RDD or Array share any supertypes or traits by looking at the respective scaladocs. Perhaps implicit conversions could be used here. What I would like to do is have a single function whose body is like this: myData.map(myFx) where myData could be an RDD[Array[String]] (for example) or an Array[Array[String]]. Has anyone had success doing this? Thanks, Philip
Re: relationship of RDD[Array[String]] to Array[Array[String]]
Thanks Michael, That is one solution that I had thought of. It seems like a bit of overkill for the few methods I want to do this for - but I will think about it. I guess I was hoping that I was missing something more obvious/easier. Philip On 07/21/2014 11:20 AM, Michael Malak wrote: It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the corresponding pre-existing library implementations. Writing all those forwarding method calls is tedious, but Scala provides at least one bit of syntactic sugar, which alleviates having to type in twice the parameter lists for each method: http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala I'm not seeing a way to utilize implicit conversions in this case. Since Scala is statically (albeit inferred) typed, I don't see a way around having a common supertype. On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote: It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or a local collection without having to have parallel implementations? I can't tell that RDD or Array share any supertypes or traits by looking at the respective scaladocs. Perhaps implicit conversions could be used here. What I would like to do is have a single function whose body is like this: myData.map(myFx) where myData could be an RDD[Array[String]] (for example) or an Array[Array[String]]. Has anyone had success doing this? Thanks, Philip
Re: Announcing Spark 1.0.1
Hi Patrick, This is great news but I nearly missed the announcement because it had scrolled off the folder view that I have Spark users list messages go to. 40+ new threads since you sent the email out on Friday evening. You might consider having someone on your team create a spark-announcement list so that it is easier to disseminate important information like this release announcement. Thanks again for all your hard work. I know you and the rest of the team are getting a million requests a day Philip On 07/11/2014 07:35 PM, Patrick Wendell wrote: I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Multiple SparkContexts with different configurations in same JVM
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a local SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster. Is this a known issue? If not, then I would be happy to write up a bug report on what the bad/unexpected behavior is and how to reproduce it. If so, then are there plans to fix this? Perhaps there is a Jira issue I could be pointed to or other related discussion. Thanks, Philip
Re: Unit test failure: Address already in use
In my unit tests I have a base class that all my tests extend that has a setup and teardown method that they inherit. They look something like this: var spark: SparkContext = _ @Before def setUp() { Thread.sleep(100L) //this seems to give spark more time to reset from the previous test's tearDown spark = new SparkContext(local, test spark) } @After def tearDown() { spark.stop spark = null //not sure why this helps but it does! System.clearProperty(spark.master.port) } It's been since last fall (i.e. version 0.8.x) since I've examined this code and so I can't vouch that it is still accurate/necessary - but it still works for me. On 06/18/2014 12:59 PM, Lisonbee, Todd wrote: Disabling parallelExecution has worked for me. Other alternatives I’ve tried that also work include: 1. Using a lock – this will let tests execute in parallel except for those using a SparkContext. If you have a large number of tests that could execute in parallel, this can shave off some time. object TestingSparkContext { val lock = new Lock() } // before you instantiate your local SparkContext TestingSparkContext.lock.acquire() // after you call sc.stop() TestingSparkContext.lock.release() 2. Sharing a local SparkContext between tests. - This is nice because your tests will run faster. Start-up and shutdown is time consuming (can add a few seconds per test). - The downside is that your tests are using the same SparkContext so they are less independent of each other. I haven’t seen issues with this yet but there are likely some things that might crop up. Best, Todd *From:*Anselme Vignon [mailto:anselme.vig...@flaminem.com] *Sent:* Wednesday, June 18, 2014 12:33 AM *To:* user@spark.apache.org *Subject:* Re: Unit test failure: Address already in use Hi, Could your problem come from the fact that you run your tests in parallel ? If you are spark in local mode, you cannot have concurrent spark instances running. this means that your tests instantiating sparkContext cannot be run in parallel. The easiest fix is to tell sbt to not run parallel tests. This can be done by adding the following line in your build.sbt: parallelExecution in Test := false Cheers, Anselme 2014-06-17 23:01 GMT+02:00 SK skrishna...@gmail.com mailto:skrishna...@gmail.com: Hi, I have 3 unit tests (independent of each other) in the /src/test/scala folder. When I run each of them individually using: sbt test-only test, all the 3 pass the test. But when I run them all using sbt test, then they fail with the warning below. I am wondering if the binding exception results in failure to run the job, thereby causing the failure. If so, what can I do to address this binding exception? I am running these tests locally on a standalone machine (i.e. SparkContext(local, test)). 14/06/17 13:42:48 WARN component.AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@3487b78d mailto:org.eclipse.jetty.server.Server@3487b78d: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:174) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:139) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77) thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-test-failure-Address-already-in-use-tp7771.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Processing audio/video/images
I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use sc.textFile() to open it (assuming one path per line), and the rest is pretty much the same as above. It will probably be hard to process each individual file in parallel, unless mp3 and jpg files can be split into multiple blocks that can be processed separately. In that case, you'd need a custom (Hadoop) input format that is able to calculate the splits. But it doesn't sound like that's what you want. On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, How do one process for data sources other than text? Lets say I have millions of mp3 (or jpeg) files and I want to use spark to process them? How does one go about it. I have never been able to figure this out.. Lets say I have this library in python which works like following: import audio song = audio.read_mp3(filename) Then most of the methods are attached to song or maybe there is another function which takes song type as an input. Maybe the above is just rambling.. but how do I use spark to process (say) audiio files. Thanks
Re: Use SparkListener to get overall progress of an action
Hi Pierre, I asked a similar question on this list about 6 weeks ago. Here is one answer http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E I got that is of particular note: In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala, which outputs information in a somewhat arbitrary format and will be deprecated soon. If you find this feature useful, you can test it out by building the master branch of Spark yourself, following the instructions in https://github.com/apache/spark/pull/42. On 05/22/2014 08:51 AM, Pierre B wrote: Is there a simple way to monitor the overall progress of an action using SparkListener or anything else? I see that one can name an RDD... Could that be used to determine which action triggered a stage, ... ? Thanks Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark unit testing best practices
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster because of different serialization requirements between the two. Perhaps it is different when using Kryo On 05/14/2014 04:34 AM, Andras Nemeth wrote: E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster.
Re: Opinions stratosphere
Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out to 64GB, 128GB, 1TB, and so on... On 05/01/2014 06:02 PM, Christopher Nguyen wrote: Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of an in-memory dataset (e.g., RDD). In principle this could be argued to be an implementation detail; the operators and execution plan/data flow are of primary concern in the API, and the data representation/materializations are otherwise unspecified. But in practice, for long-running interactive applications, I consider RDDs to be of fundamental, first-class citizen importance, and the key distinguishing feature of Spark's model vs other in-memory approaches that treat memory merely as an implicit cache. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: I don’t know a lot about it except from the research side, where the team has done interesting optimization stuff for these types of applications. In terms of the engine, one thing I’m not sure of is whether Stratosphere allows explicit caching of datasets (similar to RDD.cache()) and interactive queries (similar to spark-shell). But it’s definitely an interesting project to watch. Matei On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com mailto:achau...@brightcove.com wrote: Hi, That's what I thought but as per the slides on http://www.stratosphere.eu they seem to know about spark and the scala api does look similar. I found the PACT model interesting. Would like to know if matei or other core comitters have something to weight in on. -- Ankur On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I've never seen that project before, would be interesting to get a comparison. Seems to offer a much lower level API. For instance this is a wordcount program: https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan achau...@brightcove.com mailto:achau...@brightcove.com wrote: Hi, I was just curious about https://github.com/stratosphere/stratosphere and how does spark compare to it. Anyone has any experience with it to make any comments? -- Ankur
RDD.tail()
Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to do something equivalent with a filter() call seem a bit contorted. Any thoughts? Thanks, Philip
Re: Is there a way to get the current progress of the job?
This is great news thanks for the update! I will either wait for the 1.0 release or go and test it ahead of time from git rather than trying to pull it out of JobLogger or creating my own SparkListener. On 04/02/2014 06:48 PM, Andrew Or wrote: Hi Philip, In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala, which outputs information in a somewhat arbitrary format and will be deprecated soon. If you find this feature useful, you can test it out by building the master branch of Spark yourself, following the instructions in https://github.com/apache/spark/pull/42. Andrew On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
Re: Is there a way to get the current progress of the job?
I can appreciate the reluctance to expose something like the JobProgressListener as a public interface. It's exactly the sort of thing that you want to deprecate as soon as something better comes along and can be a real pain when trying to maintain the level of backwards compatibility that we all expect from commercial grade software. Instead of simply marking it private and therefore unavailable to Spark developers, it might be worth incorporating something like a @Beta annotation http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html which you could sprinkle liberally throughout Spark that communicates hey use this if you want to cause its here now and don't come crying if we rip it out or change it later. This might be better than simply marking so many useful functions/classes as private. I bet such an annotation could generate a compile warning/error for those who don't want to risk using them. On 04/02/2014 06:40 PM, Patrick Wendell wrote: Hey Phillip, Right now there is no mechanism for this. You have to go in through the low level listener interface. We could consider exposing the JobProgressListener directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
Re: Is there a way to get the current progress of the job?
What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/