Hi,
I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated
to Spark. Spark runs in a STANDALONE mode.
I am trying to load some 200+GB files and cache the rows using ".cache()".
What I would like to do is the following: (ATM from the scala console)
-Evenly load the files across the 20 servers (preferably using all 20*24 cores
for the load)
-Verify that data are loaded as NODE_LOCAL
Looking into the :4040 console, I see in some runs a lot of NODE_LOCAL but in
others a lot of ANY. Is there a way to identify what is that TID doing in ANY
If I allocate less than ~double the memory I need, I get an OutOfMemory error.
If I use the textFile (int) parameter,
* i.e. "sc.textFile("hdfs://...",20)
Then the error goes away.
On the other hand, if I allocate enough memory, I can see from the admin
console that some of my workers have too much load and some other less than
half. I understand that I could use a partitioner to balance my data but I
wouldn't expect an OOME if nodes are significantly under-used. Am I missing
something?
Thanks,
Ioannis Deligiannis
_______________________________________________
This message is for information purposes only, it is not a recommendation,
advice, offer or solicitation to buy or sell a product or service nor an
official confirmation of any transaction. It is directed at persons who are
professionals and is not intended for retail customer use. Intended for
recipient only. This message is subject to the terms at:
www.barclays.com/emaildisclaimer.
For important disclosures, please see:
www.barclays.com/salesandtradingdisclaimer regarding market commentary from
Barclays Sales and/or Trading, who are active market participants; and in
respect of Barclays Research, including disclosures relating to specific
issuers, please see http://publicresearch.barclays.com.
_______________________________________________