In addition, you can
easily verify there are no collisions with Spark before running anything
through GraphX -- create the mapping and then groupByKey to find any
keys with multiple mappings.
Ewen Cheslack-Postava
February 24, 2014
at 10:58 AM
You can almost certainly
You can almost certainly
take half of the UUID safely, assuming you're using random UUIDs. You
could work out the math if you're really concerned, but the probability
of a collision in 64 bits is probably pretty low even with a very large
data set. If your UUIDs aren't version 4, you probably j
f
memory?
On Wed, Feb
19, 2014 at 11:05 PM, Ewen Cheslack-Postava <m...@ewencp.org>
wrote:
Only originalRDD is cached. You need to call cache/persist for every RDD
you want cached.
David Thomas
February 19, 2014
at 10:03 PM
When I
persist/cache an RDD, are al
Only originalRDD is cached. You need to call cache/persist for every RDD
you want cached.
David Thomas
February 19, 2014
at 10:03 PM
When I
persist/cache an RDD, are all the derived RDDs cached as well or do I
need to call cache individually on each RDD if I need them
You need to tell it to use
more cores by specifying MASTER=local[N] where N is the number of cores
you want to use. See the "Initializing Spark" and "Master URLs"
sections of the scala programming guide:
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
Ewen
cala programs to work on my cluster yet.
thanks-Soumya
Ewen Cheslack-Postava
February 7, 2014
at 4:39 PM
No need for a custom
reduceByKey. You can just use the bucket as the key and then group by
key. For example, if you have an RDD[DataElement] with your data (
No need for a custom
reduceByKey. You can just use the bucket as the key and then group by
key. For example, if you have an RDD[DataElement] with your data (not
keyed by anything yet), do something like
data.keyBy(elem => getBucket(elem.timestamp)).groupByKey()
where getBucket gets a bucket ID
You might also try forcing
allocation of new strings when you store them, i.e. do new
String(orig_result) as the final step. Depending on the operations you
do, your strings may be backed by much larger strings. For example,
substring() may hold onto the underlying bytes and just present a subs
getting the
error message. It seems that foreachPartition doesn't exists as part of
the DStream class :-\ I will check API docs to find other alternatives.
2014-02-02
Ewen Cheslack-Postava <m...@ewencp.org>:If you use
anything
created on the driver program within functions run on worker
If you use anything
created on the driver program within functions run on workers, it needs
to be serializable, but your pool of Redis connections is not. Normally,
the simple way to fix this is to use the *With methods of RDD (mapWith,
flatMapWith, filterWith, and in this case, foreachWith) to
I think Mayur pointed to
that code because it includes the relevant initialization code you were
asking about. Running on a cluster doesn't require much change: pass the
spark:// address of the master instead of "local" and add any jars
containing your code. You could set the jars manually, but
014
1:22 PM
Thanks Patrick
and Ewen,Great answers.So a
shuffle dependency that can cause a shuffle will store the data in
memory + disk. More often in memory. Is my understanding
correct ?
Regards,SB
Ewen Cheslack-Postava
January 16, 2014
1:08 PM
The d
The difference between a
shuffle dependency and a transformation that can cause a shuffle is
probably worth pointing out.
The mentioned transformations (groupByKey, join, etc) *might* generate a
shuffle dependency on input RDDs, but they won't necessarily. For
example, if you join() two RDDs t
From the stack
trace, it
looks like the driver program is dying trying to serialize data out to
the workers. My guess is that whatever machine you're running from has a
relatively small default maximum heap size and trying to broadcast the
49MB file is causing it to run out of memory. I don't
Obviously it depends on
what is missing, but if I were you, I'd try monkey patching pyspark with
the functionality you need first (along with submitting a pull request,
of course). The pyspark code is very readable, and a lot of
functionality just builds on top of a few primitives, as in the Sc
I build an uber jar with sbt. I used this
https://github.com/ianoc/ExampleSparkProject as a starting point. This
generates a uber jar with sbt-assembly and I just need to ship that
one file to the master. Then I use spark to ship it to the executors.
I use the EC2 standalone script. Once I've got t
You can use the SPARK_MEM environment variable instead of setting the
system property.
If you need to set other properties that can't be controlled by
environment variables (which is why I wrote that patch), you can just
apply that patch directly to your binary package -- it only patches a
Python
This line
> 13/11/19 23:17:20 INFO MemoryStore: MemoryStore started with capacity 323.9 MB
looks like what you'd get if you haven't set spark.executor.memory (or
SPARK_MEM). Without setting it you'll get the default to 512m per
executor and .66 of that for the cache.
-Ewen
-
Well, he did mention that not everything was staying in the cache, so even
with an ongoing job they're probably be re-reading from Cassandra. It
sounds to me like the first issue to address is why things are being
evicted.
-Ewen
-
Ewen Cheslack-Postava
StraightUp | http://readstraightu
19 matches
Mail list logo