You can try setting "mapred.map.tasks" to get Hive to do the right thing.


On Sun, May 25, 2014 at 7:27 PM, qingyang li <liqingyang1...@gmail.com>wrote:

> Hi, Aaron, thanks for sharing.
>
> I am using shark to execute query , and table is created on tachyon. I
> think i can not using RDD#repartition() in shark CLI;
> if shark support "SET mapred.max.split.size" to control file size ?
> if yes,  after i create table, i can control file num,  then   I can
> control task number.
> if not , do anyone know other way to control task number in shark CLI?
>
>
> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>
> How many partitions are in your input data set? A possibility is that your
>> input data has 10 unsplittable files, so you end up with 10 partitions. You
>> could improve this by using RDD#repartition().
>>
>> Note that mapPartitionsWithIndex is sort of the "main processing loop"
>> for many Spark functions. It is iterating through all the elements of the
>> partition and doing some computation (probably running your user code) on
>> it.
>>
>> You can see the number of partitions in your RDD by visiting the Spark
>> driver web interface. To access this, visit port 8080 on host running your
>> Standalone Master (assuming you're running standalone mode), which will
>> have a link to the application web interface. The Tachyon master also has a
>> useful web interface, available at port 19999.
>>
>>
>> On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1...@gmail.com>wrote:
>>
>>> hi, Mayur, thanks for replying.
>>> I know spark application should take all cores by default. My question
>>> is  how to set task number on each core ?
>>> If one silce, one task,  how can i set silce file size ?
>>>
>>>
>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>>>
>>> How many cores do you see on your spark master (8080 port).
>>>> By default spark application should take all cores when you launch it.
>>>> Unless you have set max core configuration.
>>>>
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li 
>>>> <liqingyang1...@gmail.com>wrote:
>>>>
>>>>> my aim of setting task number is to increase the query speed,    and I
>>>>> have also found " mapPartitionsWithIndex at 
>>>>> Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>"
>>>>> is costing much time.  so, my another question is :
>>>>> how to tunning 
>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>>> to make the costing time down?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>
>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>>> but i find there are only10 tasks on the cluster and 2 tasks each
>>>>>> machine.
>>>>>>
>>>>>>
>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>
>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
>>>>>>> shark-env.sh
>>>>>>>
>>>>>>>
>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>>
>>>>>>> i am using tachyon as storage system and using to shark to query a
>>>>>>>> table which is a bigtable, i have 5 machines as a spark cluster, there 
>>>>>>>> are
>>>>>>>> 4 cores on each machine .
>>>>>>>> My question is:
>>>>>>>> 1. how to set task number on each core?
>>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to