Hi Li, I've also found this setting confusing in the past. Take a look at this change -- do you think it makes the setting more clear?
https://github.com/apache/incubator-spark/pull/341/files Andrew On Mon, Jan 6, 2014 at 8:19 AM, lihu <[email protected]> wrote: > Sorry for my late reply, because the gmail do not notice me. > > It is my fault that cause this problem. > I take the config parameter* spark.core.max *as the maximum num in every > machine, but it is the total number in fact. > > and thank Andrew and Mayur very much, your answer let understand more > about the spark system. > > > > On Fri, Jan 3, 2014 at 2:28 AM, Mayur Rustagi <[email protected]>wrote: > >> Andrew that a good point. I have done that for handling a large number of >> queries. Typically to get good response time on large number of queries in >> parallel, you would want them replicated on a lot of systems. >> Regards >> Mayur Rustagi >> Ph: +919632149971 >> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >> https://twitter.com/mayur_rustagi >> >> >> >> On Thu, Jan 2, 2014 at 11:22 PM, Andrew Ash <[email protected]> wrote: >> >>> That sounds right Mayur. >>> >>> Also in 0.8.1 I hear there's a new repartition method that you might be >>> able to use to further distribute the data. But if your data is so small >>> that it fits in just a couple blocks, why are you using 20 machines just to >>> process a quarter GB of data? Is the computation on each bit extremely >>> intensive? >>> >>> >>> On Thu, Jan 2, 2014 at 12:39 PM, Mayur Rustagi >>> <[email protected]>wrote: >>> >>>> I have experienced a similar issue. The easiest fix I found was to >>>> increase the replication of the data being used in the worker to the number >>>> of workers you want to use for processing. The RDD seem to created on all >>>> the machines where the blocks are replicated. Please correct me if I am >>>> wrong. >>>> >>>> Regards >>>> Mayur >>>> >>>> Mayur Rustagi >>>> Ph: +919632149971 >>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >>>> https://twitter.com/mayur_rustagi >>>> >>>> >>>> >>>> On Thu, Jan 2, 2014 at 10:46 PM, Andrew Ash <[email protected]>wrote: >>>> >>>>> Hi lihu, >>>>> >>>>> Maybe the data you're accessing is in in HDFS and only resides on 4 of >>>>> your 20 machines because it's only about 4 blocks (at default 64MB / block >>>>> that's around a quarter GB). Where is your source data located and how is >>>>> it stored? >>>>> >>>>> Andrew >>>>> >>>>> >>>>> On Thu, Jan 2, 2014 at 7:53 AM, lihu <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> I run spark on a cluster with 20 machine, but when I start an >>>>>> application use the spark-shell, there only 4 machine is working , the >>>>>> other with just idle, without memery and cpu used, I watch this through >>>>>> webui. >>>>>> >>>>>> I wonder the other machine maybe busy, so i watch the machines >>>>>> using "top" and "free" command, but this is not。 >>>>>> >>>>>> * So I just wonder why not spark assignment work to all all the 20 >>>>>> machine? this is not a good resource usage.* >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > *Best Wishes!* > > *Li Hu(李浒) | Graduate Student* > > *Institute for Interdisciplinary Information Sciences(IIIS > <http://iiis.tsinghua.edu.cn/>)* > *Tsinghua University, China* > > *Email: [email protected] <[email protected]>* > *Tel : +86 15120081920 <%2B86%2015120081920>* > *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ > <http://iiis.tsinghua.edu.cn/zh/lihu/>* > > >
