Re: Why collect() has a stage but first() not?

2014-02-19 Thread Mark Hamstra
Not really, since it's considered mostly an implementation detail that shouldn't be important to users and that could conceivably change in the future. There are some useful comments in the code if you really want to figure out how it is implemented. And I will take a somewhat pedantic exception

Re: How to achieve this in Spark

2014-02-19 Thread Mark Hamstra
Your problem is more basic than that. You can't reference one RDD (subsetids) from within an operation on another RDD (allobjects.filter). On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta soumya.sima...@gmail.comwrote: I've a RDD that contains ids (Long). subsetids res22:

Re: Interleaving stages

2014-02-17 Thread Mark Hamstra
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me. Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the

Re: [PoC] ZPark-Ztream : driving spark stream with scalaz-stream

2014-02-14 Thread Mark Hamstra
Yup, that's pretty cool! Nice work. One niggle, though: Where you have Ok folks, now, we have a discretized stream of Int... I believe you mean ...stream of Long On Fri, Feb 14, 2014 at 4:20 PM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: This is awesome! I was wondering

Re: Spark Release 0.9.0 missing org.apache.spark.streaming package + misleading documentation on http://spark.incubator.apache.org/releases/spark-release-0-9-0.html

2014-02-12 Thread Mark Hamstra
*These package don't exist in spark anymore.* They never did exist in spark-core. If your project depends on spark-streaming, then it should pull in the spark-core dependency as well. If you only depend on spark-core, then you won't get streaming if you don't also ask for it. Similarly for

Re: Failed to find Spark examples assembly... error message

2014-02-10 Thread Mark Hamstra
Is the assembly jar actually missing, or is the script somehow looking in the wrong place? Check /home/centrifuge/spark-0.9.0-incubating-bin-hadoop1/examples/target/scala-2.10. If the examples-assembly jar is not there, then either `./sbt/sbt assembly` or `./sbt/sbt examples/assembly` should

Re: Failed to find Spark examples assembly... error message

2014-02-10 Thread Mark Hamstra
classes -rw-rw-r-- 1 centrifuge centrifuge 125932602 Feb 9 14:51 spark-examples-assembly-0.9.0-incubating.jar -rw-rw-r-- 1 centrifuge centrifuge 78485738 Feb 2 19:05 spark-examples_2.10-assembly-0.9.0-incubating.jar Thanks again. On Mon Feb 10 12:31:31 2014, Mark Hamstra wrote

Re: graphx missing from spark-shell

2014-02-10 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/527 On Mon, Feb 10, 2014 at 4:11 PM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I also did a maven clean package with the same options. -- View this message in context:

Re: What I am missing from configuration?

2014-02-05 Thread Mark Hamstra
What do you mean by the last version of spark-0.9.0? To be precise, there isn't anything known as spark-0.9.0. What was released recently is spark-0.9.0-incubating, and there is and only ever will be one version of that. If you're talking about a 0.9.0-incubating-SNAPSHOT built locally, then

Re: spark 0.9.0 on top of Mesos error Akka Actor not found

2014-02-04 Thread Mark Hamstra
export MASTER=mesos://zk://10.10.0.141:2181/mesos On Tue, Feb 4, 2014 at 2:20 AM, Francesco Bongiovanni bongiova...@gmail.com wrote: Hi everyone, I installed the latest Spark release (0.9.0), on top of Mesos, linked to my HDFS 1.2.1 (sbt assembly success, make-distribution success), and

Re: sbt dependencies for running Standalone app on Spark v0.9.0-incubating-SNAPSHOT

2014-02-04 Thread Mark Hamstra
Just use `mvn install` or `sbt publish-local` (depending on which build system you prefer to use) to put your locally-built artifacts into your .m2 or .ivy2 cache, respectively. A typical maven or sbt configuration will resolve them from those caches without any special modifications or further

Re: RDD and Partition

2014-01-28 Thread Mark Hamstra
in the range 'A' - 'M' be applied toUpperCase and not touch the remaining. Is that possible without running an 'if' condition on all the partitions in the cluster? On Tue, Jan 28, 2014 at 1:20 PM, Mark Hamstra m...@clearstorydata.comwrote: scala import org.apache.spark.RangePartitioner scala

Re: RDD and Partition

2014-01-28 Thread Mark Hamstra
Doesn't avoid an 'if' on every partition, but does avoid it on every element of every partition. On Tue, Jan 28, 2014 at 1:29 PM, Mark Hamstra m...@clearstorydata.comwrote: If I'm understanding you correctly, there's lots of ways you could do that. Here's one, continuing from the previous

Re: Can I share the RDD between multiprocess

2014-01-25 Thread Mark Hamstra
It's a basic strategy that several organizations using Spark have followed, but there isn't yet a canonical implementation or example of such a server in the Spark source code. That is likely to change before the 1.0 release, and the included job server is likely to be based on an

Re: Does foreach operation increase rdd lineage?

2014-01-25 Thread Mark Hamstra
Or just checkpoint() it. On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman jslender...@gmail.comwrote: RDDs are supposed to be immutable. Changing values using foreach seems like a bad thing to do, and is going to mess up the probability in some very difficult to understand fashion if you

Re: How to create RDD over hashmap?

2014-01-24 Thread Mark Hamstra
You usually don't need to do so explicitly since the implicit conversions in Spark will take care of that for you. Any RDD[(K, V)] is a PairRDD; so, e.g., sc.parallelize(1 to 10).map(i = (i, i.toString)) is just one of many ways to generate a PairRDD . On Fri, Jan 24, 2014 at 2:23 PM, Manoj

Re: .intersection() method on RDDs?

2014-01-23 Thread Mark Hamstra
...or `keys` instead of `map(_._1)`. On Thu, Jan 23, 2014 at 5:36 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Yup (well, with _._1 at the end!) On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash and...@andrewash.com wrote: You're thinking like this? A.map(v = (v,None)).join(B.map(v =

Re: why is it so slow to run sbt/sbt assembly in my machine?

2014-01-22 Thread Mark Hamstra
It's a bug in sbt and/or the sbt-assembly plugin. AFAIK, nobody has isolated yet exactly what triggers the problem (only a minority of users run into it), but there are workarounds and an eventual fix. The fix appears to be PR #266 https://github.com/apache/incubator-spark/pull/266. Until that

Re: why is it so slow to run sbt/sbt assembly in my machine?

2014-01-22 Thread Mark Hamstra
, Mark Hamstra m...@clearstorydata.comwrote: It's a bug in sbt and/or the sbt-assembly plugin. AFAIK, nobody has isolated yet exactly what triggers the problem (only a minority of users run into it), but there are workarounds and an eventual fix. The fix appears to be PR #266 https://github.com

Re: FileNotFoundException on distinct()?

2014-01-19 Thread Mark Hamstra
distinct() in 0.7.3 for the same computation. What do people usually have in /proc/sys/fs/file-max? I'm real surprised that 13M isn't enough. On Sat, Jan 18, 2014 at 11:47 PM, Mark Hamstra m...@clearstorydata.com wrote: distinct() needs to do a shuffle -- which is resulting in the need

Re: FileNotFoundException on distinct()?

2014-01-18 Thread Mark Hamstra
distinct() needs to do a shuffle -- which is resulting in the need to materialize the map outputs as files. count() doesn't. On Sat, Jan 18, 2014 at 10:33 PM, Ryan Compton compton.r...@gmail.comwrote: I'm able to open ~13M files. I expect the output of .distinct().count() to be under 100M,

Re: Exception in thread DAGScheduler scala.MatchError: None (of class scala.None$)

2014-01-16 Thread Mark Hamstra
and every is identical form what I can tell. (defsparkfn) uses (sparkop) under the hood as well, so that code is essentially identical, which has my scratching my head. On Wed, Jan 15, 2014 at 2:56 PM, Mark Hamstra m...@clearstorydata.comwrote: Okay, that fits with what I was expecting. What

Re: How does shuffle work in spark ?

2014-01-16 Thread Mark Hamstra
And if you are willing to take on yourself the responsibility of guaranteeing that your mapping preserves partitioning, then you can look to use mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext or mapWith, each of which has a preservesPartitioning parameter that you can set to true

Re: Exception in thread DAGScheduler scala.MatchError: None (of class scala.None$)

2014-01-15 Thread Mark Hamstra
Okay, that fits with what I was expecting. What does your reduce function look like? On Wed, Jan 15, 2014 at 2:33 PM, Soren Macbeth so...@yieldbot.com wrote: 0.8.1-incubating running locally. On January 15, 2014 at 2:28:00 PM, Mark Hamstra (m...@clearstorydata.com//m...@clearstorydata.com

Re: Getting java.netUnknownHostException

2014-01-10 Thread Mark Hamstra
Which Spark version? The 'run' script no longer exists in current Spark. On Jan 10, 2014, at 4:57 AM, Rishi ri...@knoldus.com wrote: I am trying to run standalone mode of spark. I have 2 machines, A and B. I run master on machine A by running this command : ./run

Re: graphx merge for scala 2.9

2014-01-09 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/367 On Thu, Jan 9, 2014 at 9:27 PM, Denny Lee denny.g@gmail.com wrote: Quick follow up question here - what is the rough timeline for the GraphX merge whether it is 0.8.x (hopefully) or 0.9.x? On Fri, Dec 27, 2013 at 11:11 AM, Koert

Re: How to time transformations and provide more detailed progress report?

2014-01-07 Thread Mark Hamstra
When we time an action it includes all the transformations timings too, and it is not clear which transformation takes how long. Is there a way of timing each transformation separately? Not really, because even though you may logically specify several different transformations within your

Re: mapWith and array index as key

2013-12-24 Thread Mark Hamstra
No, the index referred to in mapWith (as well as in mapPartitionsWithIndex and several other RDD methods) is the index of the RDD's partitions. So, for example, in a typical case of an RDD read in from a distributed filesystem where the input file occupies n blocks, the index values in mapWith

Re: Noob Spark questions

2013-12-23 Thread Mark Hamstra
Though there is a saying that Java in Spark is slower than Scala shell That shouldn't be said. The Java API is mostly a thin wrapper of the Scala implementation, and the performance of the Java API is intended to be equivalent to that of the Scala API. If you're finding that not to be true,

Re: how to kill a stage in spark

2013-12-19 Thread Mark Hamstra
That's not really something that should be done by a Spark user since there isn't a public API to directly launch, cancel, and clean up after stages -- and doing so internally within Spark requires some knowledge and concern for how stages are created, tracked, controlled, and coordinated between

Re: exposing spark through a web service

2013-12-13 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/222 On Fri, Dec 13, 2013 at 8:36 AM, Philip Ogren philip.og...@oracle.comwrote: Hi Spark Community, I would like to expose my spark application/libraries via a web service in order to launch jobs, interact with users, etc. I'm sure there are

Re: reading a specific key-value

2013-12-13 Thread Mark Hamstra
Right, if your RDD has a Partitioner, then lookup() will use that to determine which partition contains the key that you want to lookup and only run a task on that partition. That still doesn't efficiently solve the lookup-a-set-of-keys problem, but extending lookup() to efficiently handle a

Re: Re: reading a specific key-value

2013-12-13 Thread Mark Hamstra
It means that the partitioner (Option[Partitioner]) field of the RDD is Some(p), not None. Which, in turn, means that for a key k, the RDD knows how to find which partition contains that k. In order for that to be true, the RDD has to have been partitioned by key, and after that only

Re: Spark map performance question

2013-12-10 Thread Mark Hamstra
You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit rdd2.count, rdd1 is not yet cached (no action has been performed on it since it was marked as cached) and has to be completely re-evaluated. On the other hand, by the time you

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Mark Hamstra
Neither map nor mapPartitions mutates an RDD -- if you establish an immutable reference to an rdd (e.g., in Scala, val rdd = ...), the data contained within that RDD will be the same regardless of any map or mapPartition transformations. However, if you re-assign the reference to point to the

Re: RDD cache question

2013-11-30 Thread Mark Hamstra
Your question doesn't really make any sense without specifying where any RDD actions take place (i.e. where Spark jobs are actually run.) Without any actions, all you've outlined so far are different ways to specify the chain of transformations that should be evaluated when an action is

Re: RDD cache question

2013-11-30 Thread Mark Hamstra
() (in step 2.) would be performing calculations and writing information to a DB. Is this the information that was missing ? Thanks, Yadid On 11/30/13 9:24 PM, Mark Hamstra wrote: Your question doesn't really make any sense without specifying where any RDD actions take place (i.e. where

Re: Whitelisting Spark ports in iptables

2013-11-24 Thread Mark Hamstra
Problem is, though, that if a given port is found not to be available, then Spark will try another; so your list is more a list of the first ports Spark will try than it is of the ports Spark will use. On Sun, Nov 24, 2013 at 5:06 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark list, I'm

Re: Reuse the Buffer Array in the map function?

2013-11-19 Thread Mark Hamstra
mapWith can make this use case even simpler. On Nov 19, 2013, at 1:29 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: You can use mapPartition, which allows you to apply the map function elementwise to all elements of a partition. Here you can place custom code around your

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
On Tue, Nov 19, 2013 at 6:50 AM, Yadid Ayzenberg ya...@media.mit.eduwrote: Hi all, According to the documentation, spark standalone currently only supports a FIFO scheduling system. I understand its possible to limit the number of cores a job uses by setting spark.cores.max. When running

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
According to the documentation, spark standalone currently only supports a FIFO scheduling system. That's not true.http://spark.incubator.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools [sorry for the prior misfire] On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra m

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
, Mark Hamstra m...@clearstorydata.com wrote: According to the documentation, spark standalone currently only supports a FIFO scheduling system. That's not true. [sorry for the prior misfire] On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra m...@clearstorydata.com wrote

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
scheduling. On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg ya...@media.mit.eduwrote: My bad - I should have stated that up front. I guess it was kind of implicit within my question. Thanks for your help, Yadid On 11/19/13 10:59 AM, Mark Hamstra wrote: Ah, sorry -- misunderstood

Re: How to filter a sorted RDD

2013-11-04 Thread Mark Hamstra
. Xiang 2013/11/4 Mark Hamstra m...@clearstorydata.com You could short-circuit the filtering within the interator function supplied to mapPartitions. On Sunday, November 3, 2013, Xiang Huo wrote: Hi all, I am trying to filter a smaller RDD data set from a large RDD data set

Re: How to filter a sorted RDD

2013-11-03 Thread Mark Hamstra
You could short-circuit the filtering within the interator function supplied to mapPartitions. On Sunday, November 3, 2013, Xiang Huo wrote: Hi all, I am trying to filter a smaller RDD data set from a large RDD data set. And the large one is sorted. So my question is that is there any way

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
1) when you say Cascading is relatively agnostic about the distributed topology underneath it I take that as a hedge that suggests that while it could be possible to run Spark underneath Cascading this is not something commonly done or would necessarily be straightforward. Is this an unfair

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
And I didn't mean to skip over you, Koert. I'm just more familiar with what Oscar said on the subject than with your opinion. On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote: Hmmm... I was unaware of this concept that Spark is for medium to large datasets

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
/realtime queries to giant batch jobs. as far as i am concerned map-red would be done. our clusters of the future would be hdfs + spark. On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra m...@clearstorydata.comwrote: And I didn't mean to skip over you, Koert. I'm just more familiar with what Oscar

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Mark Hamstra
at 8:14 AM, Mark Hamstra m...@clearstorydata.comwrote: There's no need to guess at that. The docs tell you directly: def countByValue(): Map[T, Long] Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Mark Hamstra
AM, Mark Hamstra m...@clearstorydata.comwrote: Yes, but that also illustrates the problem faced by anyone trying to write a little white paper or guide lines to make newbies' experience painless. Distributed computing clusters are necessarily complex things, and problems can crop up

Re: unable to serialize analytics pipeline

2013-10-22 Thread Mark Hamstra
If you distribute the needed jar(s) to your Workers, you may well be able to instantiate what you need using mapPartitions, mapPartitionsWithIndex, mapWith, flatMapWith, etc. Be careful, though, about teardown of any resource allocation that you may need to do within each partition. On Tue,

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Mark Hamstra
, if anyone knows of such an example, I would be very appreciative!). This certainly won't make everything completely painless, but would be invaluable and certainly seems feasible. Thanks again everyone for you help and advice. Tim On Tue, Oct 22, 2013 at 12:01 PM, Mark Hamstra m

Re: Visitor function to RDD elements

2013-10-22 Thread Mark Hamstra
mapPartitions mapPartitionsWithIndex With care, you can use these and maintain the iteration order within partitions. Beware, though, that any reduce functions need to be associative and commutative. On Tue, Oct 22, 2013 at 12:28 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, I

Re: Visitor function to RDD elements

2013-10-22 Thread Mark Hamstra
Correct; that's the completely degenerate case where you can't do anything in parallel. Often you'll also want your iterator function to send back some information to an accumulator (perhaps just the result calculated with the last element of the partition) which is then fed back into the

Re: Spark map function question

2013-10-21 Thread Mark Hamstra
How can you possibly expect anyone to give you meaningful advice when you have told us nothing about what 'func' or 'proj' does? All I can tell from your post is that you are using unusual syntax; but that on its own isn't sufficient to cause a problem: scala val func: Int = Int = i = i + 1

Re: spark 0.8

2013-10-17 Thread Mark Hamstra
Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't be for several months. On Thu, Oct 17, 2013 at 3:11 PM, dachuan hdc1...@gmail.com wrote: I'm sorry if this doesn't answer your question directly, but I have tried spark 0.9.0 and hdfs 1.0.4 just now, it works.. On

Re: Spark REPL produces error on a piece of scala code that works in pure Scala REPL

2013-10-12 Thread Mark Hamstra
That's a TODO that is either now possible in the 2.10 branch or pretty close to possible -- which isn't the same thing as easy. On Sat, Oct 12, 2013 at 2:20 PM, Aaron Davidson ilike...@gmail.com wrote: Out of curiosity, does the Scala 2.10 Spark interpreter patch fix this using macros as

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-11 Thread Mark Hamstra
That won't work. First, parallelize is a SparkContext method called on collections present in your driver process, not an RDD method. An RDD is already a parallel collection, so there is no need to parallelize it. Second, where do your input files reside? It makes a big difference whether they

Re: Standalone Cluster check your cluster UI to ensure that workers are registered error Only For Shark

2013-10-11 Thread Mark Hamstra
If you are using Spark 0.8.0, then you need to be using Shark built from branch-0.8 (or master) of https://github.com/amplab/shark On Fri, Oct 11, 2013 at 12:14 AM, vinayak navale vinayak...@gmail.comwrote: Hi, I have spark up and running, spark-shell works, and issues jobs to the spark

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-11 Thread Mark Hamstra
If your input files are already in HDFS, then parallelizing the parsing or other transformation of their contents in Spark should be easy -- that's just the way the system works. So you should end up with something like: val inputFilenames: List[String] = ...whatever you need to do to generate a

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Mark Hamstra
No, that is not what allowLocal means. For a very few actions, the DAGScheduler will run the job locally (in a separate thread on the master node) if the RDD in the action has a single partition and no dependencies in its lineage. If allowLocal is false, that doesn't mean that

Re: Scala 2.10?

2013-09-19 Thread Mark Hamstra
Spark using 2.10 (or maybe 2.11) and newer Akka will come together -- but not until after 0.8.0, which is in release candidates now. On Thu, Sep 19, 2013 at 10:20 AM, David Greco gr...@acm.org wrote: Hi Spark team, I totally agree with Paul. Spark it's an amazing tool and there are not real

Re: Scala 2.10?

2013-09-19 Thread Mark Hamstra
Can you motivate yourself enough, Paul, to enter into JIRA your feature requests for Streaminghttps://spark-project.atlassian.net/browse/STREAMING and Spark https://spark-project.atlassian.net/browse/SPARK? That way we can avoid a bad combination of lazy and forgetful. On Thu, Sep 19, 2013 at