Andrew that a good point. I have done that for handling a large number of queries. Typically to get good response time on large number of queries in parallel, you would want them replicated on a lot of systems. Regards Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi
On Thu, Jan 2, 2014 at 11:22 PM, Andrew Ash <[email protected]> wrote: > That sounds right Mayur. > > Also in 0.8.1 I hear there's a new repartition method that you might be > able to use to further distribute the data. But if your data is so small > that it fits in just a couple blocks, why are you using 20 machines just to > process a quarter GB of data? Is the computation on each bit extremely > intensive? > > > On Thu, Jan 2, 2014 at 12:39 PM, Mayur Rustagi <[email protected]>wrote: > >> I have experienced a similar issue. The easiest fix I found was to >> increase the replication of the data being used in the worker to the number >> of workers you want to use for processing. The RDD seem to created on all >> the machines where the blocks are replicated. Please correct me if I am >> wrong. >> >> Regards >> Mayur >> >> Mayur Rustagi >> Ph: +919632149971 >> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >> https://twitter.com/mayur_rustagi >> >> >> >> On Thu, Jan 2, 2014 at 10:46 PM, Andrew Ash <[email protected]> wrote: >> >>> Hi lihu, >>> >>> Maybe the data you're accessing is in in HDFS and only resides on 4 of >>> your 20 machines because it's only about 4 blocks (at default 64MB / block >>> that's around a quarter GB). Where is your source data located and how is >>> it stored? >>> >>> Andrew >>> >>> >>> On Thu, Jan 2, 2014 at 7:53 AM, lihu <[email protected]> wrote: >>> >>>> Hi, >>>> I run spark on a cluster with 20 machine, but when I start an >>>> application use the spark-shell, there only 4 machine is working , the >>>> other with just idle, without memery and cpu used, I watch this through >>>> webui. >>>> >>>> I wonder the other machine maybe busy, so i watch the machines >>>> using "top" and "free" command, but this is not。 >>>> >>>> * So I just wonder why not spark assignment work to all all the 20 >>>> machine? this is not a good resource usage.* >>>> >>>> >>>> >>>> >>>> >>> >> >
