That sounds right Mayur. Also in 0.8.1 I hear there's a new repartition method that you might be able to use to further distribute the data. But if your data is so small that it fits in just a couple blocks, why are you using 20 machines just to process a quarter GB of data? Is the computation on each bit extremely intensive?
On Thu, Jan 2, 2014 at 12:39 PM, Mayur Rustagi <[email protected]>wrote: > I have experienced a similar issue. The easiest fix I found was to > increase the replication of the data being used in the worker to the number > of workers you want to use for processing. The RDD seem to created on all > the machines where the blocks are replicated. Please correct me if I am > wrong. > > Regards > Mayur > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Thu, Jan 2, 2014 at 10:46 PM, Andrew Ash <[email protected]> wrote: > >> Hi lihu, >> >> Maybe the data you're accessing is in in HDFS and only resides on 4 of >> your 20 machines because it's only about 4 blocks (at default 64MB / block >> that's around a quarter GB). Where is your source data located and how is >> it stored? >> >> Andrew >> >> >> On Thu, Jan 2, 2014 at 7:53 AM, lihu <[email protected]> wrote: >> >>> Hi, >>> I run spark on a cluster with 20 machine, but when I start an >>> application use the spark-shell, there only 4 machine is working , the >>> other with just idle, without memery and cpu used, I watch this through >>> webui. >>> >>> I wonder the other machine maybe busy, so i watch the machines using >>> "top" and "free" command, but this is not。 >>> >>> * So I just wonder why not spark assignment work to all all the 20 >>> machine? this is not a good resource usage.* >>> >>> >>> >>> >>> >> >
