Re: Using Spark on Data size larger than Memory size

2014-06-11 Thread Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to
workaround them by setting the following JVM options:

-Dspark.akka.askTimeout=300
-Dspark.akka.timeout=300
-Dspark.worker.timeout=300

NOTE: these JVM options *must* be set on worker nodes (and not just the
driver/master) for the settings to take.

Allen





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification.

What is the proper way to configure RDDs when your aggregate data size
exceeds your available working memory size? In particular, in additional to
typical operations, I'm performing cogroups, joins, and coalesces/shuffles.

I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need
to set all the storage level for all of my RDDs to something like
MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
the presence of coalesces/shuffles, cogroups, and joins?

Thanks,
Allen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Monitoring spark dis-associated workers

2014-06-10 Thread Allen Chang
We're running into an issue where periodically the master loses connectivity
with workers in the spark cluster. We believe this issue tends to manifest
when the cluster is under heavy load, but we're not entirely sure when it
happens. I've seen one or two other messages to this list about this issue,
but no one seems to have a clue as to the actual bug.

So, to work around the issue, we'd like to programmatically monitor the
number of workers connected to the master and restart the cluster when the
master loses track of some of its workers. Any ideas on how to
programmatically write such a health check?

Thanks,
Allen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-spark-dis-associated-workers-tp7358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RDD values and defensive copying

2014-05-23 Thread Allen Chang
Hi all,

I had a question: if I have an RDD containing mutable values, and I run a
function over the RDD which mucks with the mutable values, what happens?
What happens in the case of a cogroup? e.g.:

  inputRdd.cogroup(inputRdd2).flatMapValues(functionThatModifiesValues())

Will this result in undefined behavior? Is it a best practice then to make
sure functionThatModifiesValues() performs a defensive copy if necessary?

Thanks,
Allen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-values-and-defensive-copying-tp6336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.