Hi, Aaron, thanks for sharing.

I am using shark to execute query , and table is created on tachyon. I
think i can not using RDD#repartition() in shark CLI;
if shark support "SET mapred.max.split.size" to control file size ?
if yes,  after i create table, i can control file num,  then   I can
control task number.
if not , do anyone know other way to control task number in shark CLI?


2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:

> How many partitions are in your input data set? A possibility is that your
> input data has 10 unsplittable files, so you end up with 10 partitions. You
> could improve this by using RDD#repartition().
>
> Note that mapPartitionsWithIndex is sort of the "main processing loop" for
> many Spark functions. It is iterating through all the elements of the
> partition and doing some computation (probably running your user code) on
> it.
>
> You can see the number of partitions in your RDD by visiting the Spark
> driver web interface. To access this, visit port 8080 on host running your
> Standalone Master (assuming you're running standalone mode), which will
> have a link to the application web interface. The Tachyon master also has a
> useful web interface, available at port 19999.
>
>
> On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1...@gmail.com>wrote:
>
>> hi, Mayur, thanks for replying.
>> I know spark application should take all cores by default. My question
>> is  how to set task number on each core ?
>> If one silce, one task,  how can i set silce file size ?
>>
>>
>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>>
>> How many cores do you see on your spark master (8080 port).
>>> By default spark application should take all cores when you launch it.
>>> Unless you have set max core configuration.
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li 
>>> <liqingyang1...@gmail.com>wrote:
>>>
>>>> my aim of setting task number is to increase the query speed,    and I
>>>> have also found " mapPartitionsWithIndex at 
>>>> Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>"
>>>> is costing much time.  so, my another question is :
>>>> how to tunning 
>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>> to make the costing time down?
>>>>
>>>>
>>>>
>>>>
>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>
>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>> but i find there are only10 tasks on the cluster and 2 tasks each
>>>>> machine.
>>>>>
>>>>>
>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>
>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
>>>>>> shark-env.sh
>>>>>>
>>>>>>
>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>
>>>>>> i am using tachyon as storage system and using to shark to query a
>>>>>>> table which is a bigtable, i have 5 machines as a spark cluster, there 
>>>>>>> are
>>>>>>> 4 cores on each machine .
>>>>>>> My question is:
>>>>>>> 1. how to set task number on each core?
>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to