Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
In addition, you can easily verify there are no collisions with Spark before running anything through GraphX -- create the mapping and then groupByKey to find any keys with multiple mappings. Ewen Cheslack-Postava February 24, 2014 at 10:58 AM You can almost certainly

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Ewen Cheslack-Postava
You can almost certainly take half of the UUID safely, assuming you're using random UUIDs. You could work out the math if you're really concerned, but the probability of a collision in 64 bits is probably pretty low even with a very large data set. If your UUIDs aren't version 4, you probably j

Re: Basic question on RDD caching

2014-02-20 Thread Ewen Cheslack-Postava
f memory? On Wed, Feb 19, 2014 at 11:05 PM, Ewen Cheslack-Postava <m...@ewencp.org> wrote: Only originalRDD is cached. You need to call cache/persist for every RDD you want cached. David Thomas February 19, 2014 at 10:03 PM When I persist/cache an RDD, are al

Re: Basic question on RDD caching

2014-02-19 Thread Ewen Cheslack-Postava
Only originalRDD is cached. You need to call cache/persist for every RDD you want cached. David Thomas February 19, 2014 at 10:03 PM When I persist/cache an RDD, are all the derived RDDs cached as well or do I need to  call cache individually on each RDD if I need them

Re: Fwd: Why is Spark not using all cores on a single machine?

2014-02-17 Thread Ewen Cheslack-Postava
You need to tell it to use more cores by specifying MASTER=local[N] where N is the number of cores you want to use. See the "Initializing Spark" and "Master URLs" sections of the scala programming guide: http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html Ewen

Re: Best way to implement this in Spark

2014-02-07 Thread Ewen Cheslack-Postava
cala programs to work on my cluster yet.  thanks-Soumya Ewen Cheslack-Postava February 7, 2014 at 4:39 PM No need for a custom reduceByKey. You can just use the bucket as the key and then group by key. For example, if you have an RDD[DataElement] with your data (

Re: Best way to implement this in Spark

2014-02-07 Thread Ewen Cheslack-Postava
No need for a custom reduceByKey. You can just use the bucket as the key and then group by key. For example, if you have an RDD[DataElement] with your data (not keyed by anything yet), do something like data.keyBy(elem => getBucket(elem.timestamp)).groupByKey() where getBucket gets a bucket ID

Re: In Memory Caching blowing up the size

2014-02-07 Thread Ewen Cheslack-Postava
You might also try forcing allocation of new strings when you store them, i.e. do new String(orig_result) as the final step. Depending on the operations you do, your strings may be backed by much larger strings. For example, substring() may hold onto the underlying bytes and just present a subs

Re: Persisting RDD to Redis

2014-02-02 Thread Ewen Cheslack-Postava
getting the error message. It seems that foreachPartition doesn't exists as part of the DStream class :-\ I will check API docs to find other alternatives. 2014-02-02 Ewen Cheslack-Postava <m...@ewencp.org>:If you use anything created on the driver program within functions run on worker

Re: Persisting RDD to Redis

2014-02-02 Thread Ewen Cheslack-Postava
If you use anything created on the driver program within functions run on workers, it needs to be serializable, but your pool of Redis connections is not. Normally, the simple way to fix this is to use the *With methods of RDD (mapWith, flatMapWith, filterWith, and in this case, foreachWith) to

Re: Running K-Means on a cluster setup

2014-01-22 Thread Ewen Cheslack-Postava
I think Mayur pointed to that code because it includes the relevant initialization code you were asking about. Running on a cluster doesn't require much change: pass the spark:// address of the master instead of "local" and add any jars containing your code. You could set the jars manually, but

Re: How does shuffle work in spark ?

2014-01-16 Thread Ewen Cheslack-Postava
014 1:22 PM Thanks Patrick and Ewen,Great answers.So a shuffle dependency that can cause a shuffle will store the data in memory + disk. More often in memory. Is my understanding correct ? Regards,SB Ewen Cheslack-Postava January 16, 2014 1:08 PM The d

Re: How does shuffle work in spark ?

2014-01-16 Thread Ewen Cheslack-Postava
The difference between a shuffle dependency and a transformation that can cause a shuffle is probably worth pointing out. The mentioned transformations (groupByKey, join, etc) *might* generate a shuffle dependency on input RDDs, but they won't necessarily. For example, if you join() two RDDs t

Re: newbie : java.lang.OutOfMemoryError: Java heap space

2014-01-08 Thread Ewen Cheslack-Postava
From the stack trace, it looks like the driver program is dying trying to serialize data out to the workers. My guess is that whatever machine you're running from has a relatively small default maximum heap size and trying to broadcast the 49MB file is causing it to run out of memory. I don't

Re: Scala driver, Python workers?

2013-12-12 Thread Ewen Cheslack-Postava
Obviously it depends on what is missing, but if I were you, I'd try monkey patching pyspark with the functionality you need first (along with submitting a pull request, of course). The pyspark code is very readable, and a lot of functionality just builds on top of a few primitives, as in the Sc

Re: how do you deploy your spark application?

2013-11-22 Thread Ewen Cheslack-Postava
I build an uber jar with sbt. I used this https://github.com/ianoc/ExampleSparkProject as a starting point. This generates a uber jar with sbt-assembly and I just need to ship that one file to the master. Then I use spark to ship it to the executors. I use the EC2 standalone script. Once I've got t

Re: Re: Spark Configuration with Python

2013-11-20 Thread Ewen Cheslack-Postava
You can use the SPARK_MEM environment variable instead of setting the system property. If you need to set other properties that can't be controlled by environment variables (which is why I wrote that patch), you can just apply that patch directly to your binary package -- it only patches a Python

Re: RAM question from the Shell

2013-11-19 Thread Ewen Cheslack-Postava
This line > 13/11/19 23:17:20 INFO MemoryStore: MemoryStore started with capacity 323.9 MB looks like what you'd get if you haven't set spark.executor.memory (or SPARK_MEM). Without setting it you'll get the default to 512m per executor and .66 of that for the cache. -Ewen -

Re: Job duration

2013-10-28 Thread Ewen Cheslack-Postava
Well, he did mention that not everything was staying in the cache, so even with an ongoing job they're probably be re-reading from Cassandra. It sounds to me like the first issue to address is why things are being evicted. -Ewen - Ewen Cheslack-Postava StraightUp | http://readstraightu