Hi, I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated to Spark. Spark runs in a STANDALONE mode. I am trying to load some 200+GB files and cache the rows using ".cache()".
What I would like to do is the following: (ATM from the scala console) -Evenly load the files across the 20 servers (preferably using all 20*24 cores for the load) -Verify that data are loaded as NODE_LOCAL Looking into the :4040 console, I see in some runs a lot of NODE_LOCAL but in others a lot of ANY. Is there a way to identify what is that TID doing in ANY If I allocate less than ~double the memory I need, I get an OutOfMemory error. If I use the textFile (int) parameter, * i.e. "sc.textFile("hdfs://...",20) Then the error goes away. On the other hand, if I allocate enough memory, I can see from the admin console that some of my workers have too much load and some other less than half. I understand that I could use a partitioner to balance my data but I wouldn't expect an OOME if nodes are significantly under-used. Am I missing something? Thanks, Ioannis Deligiannis _______________________________________________ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. _______________________________________________