Not really, since it's considered mostly an implementation detail that
shouldn't be important to users and that could conceivably change in the
future. There are some useful comments in the code if you really want to
figure out how it is implemented.
And I will take a somewhat pedantic exception
Your problem is more basic than that. You can't reference one RDD
(subsetids) from within an operation on another RDD (allobjects.filter).
On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta soumya.sima...@gmail.comwrote:
I've a RDD that contains ids (Long).
subsetids
res22:
With so little information about what your code is actually doing, what you
have shared looks likely to be an anti-pattern to me. Doing many collect
actions is something to be avoided if at all possible, since this forces a
lot of network communication to materialize the results back within the
Yup, that's pretty cool! Nice work. One niggle, though: Where you have
Ok folks, now, we have a discretized stream of Int... I believe you mean
...stream of Long
On Fri, Feb 14, 2014 at 4:20 PM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
This is awesome! I was wondering
*These package don't exist in spark anymore.*
They never did exist in spark-core. If your project depends on
spark-streaming, then it should pull in the spark-core dependency as well.
If you only depend on spark-core, then you won't get streaming if you
don't also ask for it. Similarly for
Is the assembly jar actually missing, or is the script somehow looking in
the wrong place? Check
/home/centrifuge/spark-0.9.0-incubating-bin-hadoop1/examples/target/scala-2.10.
If the examples-assembly jar is not there, then either `./sbt/sbt
assembly` or `./sbt/sbt examples/assembly` should
classes
-rw-rw-r-- 1 centrifuge centrifuge 125932602 Feb 9 14:51
spark-examples-assembly-0.9.0-incubating.jar
-rw-rw-r-- 1 centrifuge centrifuge 78485738 Feb 2 19:05
spark-examples_2.10-assembly-0.9.0-incubating.jar
Thanks again.
On Mon Feb 10 12:31:31 2014, Mark Hamstra wrote
https://github.com/apache/incubator-spark/pull/527
On Mon, Feb 10, 2014 at 4:11 PM, Eric Kimbrel
eric.kimb...@soteradefense.com wrote:
I also did a maven clean package with the same options.
--
View this message in context:
What do you mean by the last version of spark-0.9.0? To be precise,
there isn't anything known as spark-0.9.0. What was released recently is
spark-0.9.0-incubating, and there is and only ever will be one version of
that. If you're talking about a 0.9.0-incubating-SNAPSHOT built locally,
then
export MASTER=mesos://zk://10.10.0.141:2181/mesos
On Tue, Feb 4, 2014 at 2:20 AM, Francesco Bongiovanni bongiova...@gmail.com
wrote:
Hi everyone,
I installed the latest Spark release (0.9.0), on top of Mesos, linked to
my
HDFS 1.2.1 (sbt assembly success, make-distribution success), and
Just use `mvn install` or `sbt publish-local` (depending on which build
system you prefer to use) to put your locally-built artifacts into your .m2
or .ivy2 cache, respectively. A typical maven or sbt configuration will
resolve them from those caches without any special modifications or further
in the range 'A' - 'M'
be applied toUpperCase and not touch the remaining. Is that possible
without running an 'if' condition on all the partitions in the cluster?
On Tue, Jan 28, 2014 at 1:20 PM, Mark Hamstra m...@clearstorydata.comwrote:
scala import org.apache.spark.RangePartitioner
scala
Doesn't avoid an 'if' on every partition, but does avoid it on every
element of every partition.
On Tue, Jan 28, 2014 at 1:29 PM, Mark Hamstra m...@clearstorydata.comwrote:
If I'm understanding you correctly, there's lots of ways you could do
that. Here's one, continuing from the previous
It's a basic strategy that several organizations using Spark have followed,
but there isn't yet a canonical implementation or example of such a server
in the Spark source code. That is likely to change before the 1.0 release,
and the included job server is likely to be based on an
Or just checkpoint() it.
On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman jslender...@gmail.comwrote:
RDDs are supposed to be immutable. Changing values using foreach seems
like a bad thing to do, and is going to mess up the probability in some
very difficult to understand fashion if you
You usually don't need to do so explicitly since the implicit conversions
in Spark will take care of that for you. Any RDD[(K, V)] is a PairRDD; so,
e.g., sc.parallelize(1 to 10).map(i = (i, i.toString)) is just one of many
ways to generate a PairRDD .
On Fri, Jan 24, 2014 at 2:23 PM, Manoj
...or `keys` instead of `map(_._1)`.
On Thu, Jan 23, 2014 at 5:36 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
Yup (well, with _._1 at the end!)
On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash and...@andrewash.com wrote:
You're thinking like this?
A.map(v = (v,None)).join(B.map(v =
It's a bug in sbt and/or the sbt-assembly plugin. AFAIK, nobody has
isolated yet exactly what triggers the problem (only a minority of users
run into it), but there are workarounds and an eventual fix. The fix
appears to be PR #266 https://github.com/apache/incubator-spark/pull/266.
Until that
, Mark Hamstra m...@clearstorydata.comwrote:
It's a bug in sbt and/or the sbt-assembly plugin. AFAIK, nobody has
isolated yet exactly what triggers the problem (only a minority of users
run into it), but there are workarounds and an eventual fix. The fix
appears to be PR #266 https://github.com
distinct() in 0.7.3 for the same computation.
What do people usually have in /proc/sys/fs/file-max? I'm real
surprised that 13M isn't enough.
On Sat, Jan 18, 2014 at 11:47 PM, Mark Hamstra m...@clearstorydata.com
wrote:
distinct() needs to do a shuffle -- which is resulting in the need
distinct() needs to do a shuffle -- which is resulting in the need to
materialize the map outputs as files. count() doesn't.
On Sat, Jan 18, 2014 at 10:33 PM, Ryan Compton compton.r...@gmail.comwrote:
I'm able to open ~13M files. I expect the output of
.distinct().count() to be under 100M,
and every is
identical form what I can tell. (defsparkfn) uses (sparkop) under the hood
as well, so that code is essentially identical, which has my scratching my
head.
On Wed, Jan 15, 2014 at 2:56 PM, Mark Hamstra m...@clearstorydata.comwrote:
Okay, that fits with what I was expecting.
What
And if you are willing to take on yourself the responsibility of
guaranteeing that your mapping preserves partitioning, then you can look to
use mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext or
mapWith, each of which has a preservesPartitioning parameter that you can
set to true
Okay, that fits with what I was expecting.
What does your reduce function look like?
On Wed, Jan 15, 2014 at 2:33 PM, Soren Macbeth so...@yieldbot.com wrote:
0.8.1-incubating running locally.
On January 15, 2014 at 2:28:00 PM, Mark Hamstra
(m...@clearstorydata.com//m...@clearstorydata.com
Which Spark version? The 'run' script no longer exists in current Spark.
On Jan 10, 2014, at 4:57 AM, Rishi ri...@knoldus.com wrote:
I am trying to run standalone mode of spark.
I have 2 machines, A and B.
I run master on machine A by running this command : ./run
https://github.com/apache/incubator-spark/pull/367
On Thu, Jan 9, 2014 at 9:27 PM, Denny Lee denny.g@gmail.com wrote:
Quick follow up question here - what is the rough timeline for the GraphX
merge whether it is 0.8.x (hopefully) or 0.9.x?
On Fri, Dec 27, 2013 at 11:11 AM, Koert
When we time an action it includes all the transformations timings too,
and it is not clear which transformation takes how long. Is there a way of
timing each transformation separately?
Not really, because even though you may logically specify several different
transformations within your
No, the index referred to in mapWith (as well as in mapPartitionsWithIndex
and several other RDD methods) is the index of the RDD's partitions. So,
for example, in a typical case of an RDD read in from a distributed
filesystem where the input file occupies n blocks, the index values in
mapWith
Though there is a saying that Java in Spark is slower than Scala shell
That shouldn't be said. The Java API is mostly a thin wrapper of the Scala
implementation, and the performance of the Java API is intended to be
equivalent to that of the Scala API. If you're finding that not to be
true,
That's not really something that should be done by a Spark user since there
isn't a public API to directly launch, cancel, and clean up after stages --
and doing so internally within Spark requires some knowledge and concern
for how stages are created, tracked, controlled, and coordinated between
https://github.com/apache/incubator-spark/pull/222
On Fri, Dec 13, 2013 at 8:36 AM, Philip Ogren philip.og...@oracle.comwrote:
Hi Spark Community,
I would like to expose my spark application/libraries via a web service in
order to launch jobs, interact with users, etc. I'm sure there are
Right, if your RDD has a Partitioner, then lookup() will use that to
determine which partition contains the key that you want to lookup and only
run a task on that partition.
That still doesn't efficiently solve the lookup-a-set-of-keys problem, but
extending lookup() to efficiently handle a
It means that the partitioner (Option[Partitioner]) field of the RDD is
Some(p), not None. Which, in turn, means that for a key k, the RDD knows
how to find which partition contains that k.
In order for that to be true, the RDD has to have been partitioned by key,
and after that only
You're not marking rdd1 as cached (actually,
to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit
rdd2.count, rdd1 is not yet cached (no action has been performed on it
since it was marked as cached) and has to be completely re-evaluated. On
the other hand, by the time you
Neither map nor mapPartitions mutates an RDD -- if you establish an
immutable reference to an rdd (e.g., in Scala, val rdd = ...), the data
contained within that RDD will be the same regardless of any map or
mapPartition transformations. However, if you re-assign the reference to
point to the
Your question doesn't really make any sense without specifying where any
RDD actions take place (i.e. where Spark jobs are actually run.) Without
any actions, all you've outlined so far are different ways to specify the
chain of transformations that should be evaluated when an action is
() (in step 2.) would be
performing calculations and writing information to a DB.
Is this the information that was missing ?
Thanks,
Yadid
On 11/30/13 9:24 PM, Mark Hamstra wrote:
Your question doesn't really make any sense without specifying where any
RDD actions take place (i.e. where
Problem is, though, that if a given port is found not to be available, then
Spark will try another; so your list is more a list of the first ports
Spark will try than it is of the ports Spark will use.
On Sun, Nov 24, 2013 at 5:06 PM, Andrew Ash and...@andrewash.com wrote:
Hi Spark list,
I'm
mapWith can make this use case even simpler.
On Nov 19, 2013, at 1:29 AM, Sebastian Schelter ssc.o...@googlemail.com
wrote:
You can use mapPartition, which allows you to apply the map function
elementwise to all elements of a partition. Here you can place custom code
around your
On Tue, Nov 19, 2013 at 6:50 AM, Yadid Ayzenberg ya...@media.mit.eduwrote:
Hi all,
According to the documentation, spark standalone currently only supports a
FIFO scheduling system.
I understand its possible to limit the number of cores a job uses by
setting spark.cores.max.
When running
According to the documentation, spark standalone currently only supports a
FIFO scheduling system.
That's not
true.http://spark.incubator.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
[sorry for the prior misfire]
On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra m
, Mark Hamstra m...@clearstorydata.com wrote:
According to the documentation, spark standalone currently only supports a
FIFO scheduling system.
That's not true.
[sorry for the prior misfire]
On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra m...@clearstorydata.com
wrote
scheduling.
On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg ya...@media.mit.eduwrote:
My bad - I should have stated that up front. I guess it was kind of
implicit within my question.
Thanks for your help,
Yadid
On 11/19/13 10:59 AM, Mark Hamstra wrote:
Ah, sorry -- misunderstood
.
Xiang
2013/11/4 Mark Hamstra m...@clearstorydata.com
You could short-circuit the filtering within the interator function
supplied to mapPartitions.
On Sunday, November 3, 2013, Xiang Huo wrote:
Hi all,
I am trying to filter a smaller RDD data set from a large RDD data set
You could short-circuit the filtering within the interator function
supplied to mapPartitions.
On Sunday, November 3, 2013, Xiang Huo wrote:
Hi all,
I am trying to filter a smaller RDD data set from a large RDD data set.
And the large one is sorted. So my question is that is there any way
1) when you say Cascading is relatively agnostic about the distributed
topology underneath it I take that as a hedge that suggests that while it
could be possible to run Spark underneath Cascading this is not something
commonly done or would necessarily be straightforward. Is this an unfair
And I didn't mean to skip over you, Koert. I'm just more familiar with
what Oscar said on the subject than with your opinion.
On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote:
Hmmm... I was unaware of this concept that Spark is for medium to large
datasets
/realtime queries to giant batch jobs. as far as i am
concerned map-red would be done. our clusters of the future would be hdfs +
spark.
On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra m...@clearstorydata.comwrote:
And I didn't mean to skip over you, Koert. I'm just more familiar with
what Oscar
at 8:14 AM, Mark Hamstra m...@clearstorydata.comwrote:
There's no need to guess at that. The docs tell you directly:
def countByValue(): Map[T, Long]
Return the count of each unique value in this RDD as a map of (value,
count) pairs. The final combine step happens locally on the master
AM, Mark Hamstra m...@clearstorydata.comwrote:
Yes, but that also illustrates the problem faced by anyone trying to
write a little white paper or guide lines to make newbies' experience
painless. Distributed computing clusters are necessarily complex things,
and problems can crop up
If you distribute the needed jar(s) to your Workers, you may well be able
to instantiate what you need using mapPartitions, mapPartitionsWithIndex,
mapWith, flatMapWith, etc. Be careful, though, about teardown of any
resource allocation that you may need to do within each partition.
On Tue,
, if anyone knows of such an example, I would be
very appreciative!). This certainly won't make everything completely
painless, but would be invaluable and certainly seems feasible.
Thanks again everyone for you help and advice.
Tim
On Tue, Oct 22, 2013 at 12:01 PM, Mark Hamstra m
mapPartitions
mapPartitionsWithIndex
With care, you can use these and maintain the iteration order within
partitions. Beware, though, that any reduce functions need to be
associative and commutative.
On Tue, Oct 22, 2013 at 12:28 PM, Matt Cheah mch...@palantir.com wrote:
Hi everyone,
I
Correct; that's the completely degenerate case where you can't do anything
in parallel. Often you'll also want your iterator function to send back
some information to an accumulator (perhaps just the result calculated with
the last element of the partition) which is then fed back into the
How can you possibly expect anyone to give you meaningful advice when you
have told us nothing about what 'func' or 'proj' does? All I can tell from
your post is that you are using unusual syntax; but that on its own isn't
sufficient to cause a problem:
scala val func: Int = Int = i = i + 1
Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't be
for several months.
On Thu, Oct 17, 2013 at 3:11 PM, dachuan hdc1...@gmail.com wrote:
I'm sorry if this doesn't answer your question directly, but I have tried
spark 0.9.0 and hdfs 1.0.4 just now, it works..
On
That's a TODO that is either now possible in the 2.10 branch or pretty
close to possible -- which isn't the same thing as easy.
On Sat, Oct 12, 2013 at 2:20 PM, Aaron Davidson ilike...@gmail.com wrote:
Out of curiosity, does the Scala 2.10 Spark interpreter patch
fix this using macros as
That won't work. First, parallelize is a SparkContext method called on
collections present in your driver process, not an RDD method. An RDD is
already a parallel collection, so there is no need to parallelize it.
Second, where do your input files reside? It makes a big difference
whether they
If you are using Spark 0.8.0, then you need to be using Shark built from
branch-0.8 (or master) of https://github.com/amplab/shark
On Fri, Oct 11, 2013 at 12:14 AM, vinayak navale vinayak...@gmail.comwrote:
Hi,
I have spark up and running, spark-shell works, and issues jobs to the
spark
If your input files are already in HDFS, then parallelizing the parsing or
other transformation of their contents in Spark should be easy -- that's
just the way the system works. So you should end up with something like:
val inputFilenames: List[String] = ...whatever you need to do to generate a
No, that is not what allowLocal means. For a very few actions, the
DAGScheduler will run the job locally (in a separate thread on the master
node) if the RDD in the action has a single partition and no dependencies
in its lineage. If allowLocal is false, that doesn't mean that
Spark using 2.10 (or maybe 2.11) and newer Akka will come together -- but
not until after 0.8.0, which is in release candidates now.
On Thu, Sep 19, 2013 at 10:20 AM, David Greco gr...@acm.org wrote:
Hi Spark team,
I totally agree with Paul. Spark it's an amazing tool and there are not
real
Can you motivate yourself enough, Paul, to enter into JIRA your feature
requests for Streaminghttps://spark-project.atlassian.net/browse/STREAMING
and
Spark https://spark-project.atlassian.net/browse/SPARK? That way we can
avoid a bad combination of lazy and forgetful.
On Thu, Sep 19, 2013 at
63 matches
Mail list logo