Hi,

I am running a cluster of relatively large Workers (about 40GB). I have noticed 
the following behaviour (although I cannot verify that this is the case):

1.       My process caches a lot of RDDs (Total: 25.3GB Cached:23.6)

2.       In the next task run, I cache another 10GB (in 0.5 GB blocks).

3.       From what I can see, each RDD-Evict is followed by a full GC and the 
by RDD-Cache


This basically kills performance because it happens 20 times (10/0.5GB=20). In 
a 40GB heap, workers spend minutes in GC.
I can see two possible (manual) solutions:

a.       Reduce worker size to ~2GB and launch 20 workers

b.      Manually evict RDDs before the next caching task.
Is there a way to configure eviction sizes? (i.e. each time evict 10RDDs 
instead of one?
Any other comments welcomed...

Regards,

Ioannis Deligiannis


_______________________________________________

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

_______________________________________________

Reply via email to