How many partitions are in your input data set? A possibility is that your
input data has 10 unsplittable files, so you end up with 10 partitions. You
could improve this by using RDD#repartition().

Note that mapPartitionsWithIndex is sort of the "main processing loop" for
many Spark functions. It is iterating through all the elements of the
partition and doing some computation (probably running your user code) on
it.

You can see the number of partitions in your RDD by visiting the Spark
driver web interface. To access this, visit port 8080 on host running your
Standalone Master (assuming you're running standalone mode), which will
have a link to the application web interface. The Tachyon master also has a
useful web interface, available at port 19999.


On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1...@gmail.com>wrote:

> hi, Mayur, thanks for replying.
> I know spark application should take all cores by default. My question is
> how to set task number on each core ?
> If one silce, one task,  how can i set silce file size ?
>
>
> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>
> How many cores do you see on your spark master (8080 port).
>> By default spark application should take all cores when you launch it.
>> Unless you have set max core configuration.
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, May 22, 2014 at 4:07 PM, qingyang li <liqingyang1...@gmail.com>wrote:
>>
>>> my aim of setting task number is to increase the query speed,    and I
>>> have also found " mapPartitionsWithIndex at 
>>> Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>"
>>> is costing much time.  so, my another question is :
>>> how to tunning 
>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>> to make the costing time down?
>>>
>>>
>>>
>>>
>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>
>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>> default.parallelism=40 "  in shark-env.sh,
>>>> but i find there are only10 tasks on the cluster and 2 tasks each
>>>> machine.
>>>>
>>>>
>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>
>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
>>>>> shark-env.sh
>>>>>
>>>>>
>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>
>>>>> i am using tachyon as storage system and using to shark to query a
>>>>>> table which is a bigtable, i have 5 machines as a spark cluster, there 
>>>>>> are
>>>>>> 4 cores on each machine .
>>>>>> My question is:
>>>>>> 1. how to set task number on each core?
>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to