Re: how to control task number?

qingyang li Tue, 27 May 2014 02:06:29 -0700

when i using "create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 limit 400000;" ,  there will be 4
files created on tachyon.
but when i using "create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 ;" ,  there will be 35 files created
on tachyon.
so, I think spark/shark  know how to split files when creating table,
spark/shark will partition table into many parts on tatchyon?  how
spark/shark split table into many parts?  could i control it's spliting by
setting some configuration ,such as setting "map.split.size=64M" ?



2014-05-27 16:59 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:

> when i using "create table bigtable002 tblproperties('shark.cache'='
> tachyon') as select * from bigtable001 limit 400000;" ,  there will be 4
> files created on tachyon.
> but when i using "create table bigtable002 tblproperties('shark.cache'='
> tachyon') as select * from bigtable001 ;" ,  there will be 35 files
> created on tachyon.
> so, I think spark/shark know how to split files when creating table,
> could i control it's spliting by setting some configuration ,such as
> setting "map.split.size=64M" ?
>
>
> 2014-05-26 12:14 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>
> I using " create table bigtable002 tblproperties('shark.cache'='tachyon')
>> as select * from bigtable001"  to create table bigtable002; while
>> bigtable001 is load from hdfs, it's format is text file ,  so i think
>> bigtable002's is text.
>>
>>
>> 2014-05-26 11:14 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>
>> What is the format of your input data, prior to insertion into Tachyon?
>>>
>>>
>>> On Sun, May 25, 2014 at 7:52 PM, qingyang li 
>>> <liqingyang1...@gmail.com>wrote:
>>>
>>>> i tried "set mapred.map.tasks=30" , it does not work, it seems shark
>>>> does not support this setting.
>>>> i also tried "SET mapred.max.split.size=64000000", it does not
>>>> work,too.
>>>> is there other way to control task number in shark CLI ?
>>>>
>>>>
>>>>
>>>> 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>>>
>>>> You can try setting "mapred.map.tasks" to get Hive to do the right
>>>>> thing.
>>>>>
>>>>>
>>>>> On Sun, May 25, 2014 at 7:27 PM, qingyang li <liqingyang1...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi, Aaron, thanks for sharing.
>>>>>>
>>>>>> I am using shark to execute query , and table is created on tachyon.
>>>>>> I think i can not using RDD#repartition() in shark CLI;
>>>>>> if shark support "SET mapred.max.split.size" to control file size ?
>>>>>> if yes,  after i create table, i can control file num,  then   I can
>>>>>> control task number.
>>>>>> if not , do anyone know other way to control task number in shark CLI?
>>>>>>
>>>>>>
>>>>>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>>>>>
>>>>>> How many partitions are in your input data set? A possibility is that
>>>>>>> your input data has 10 unsplittable files, so you end up with 10
>>>>>>> partitions. You could improve this by using RDD#repartition().
>>>>>>>
>>>>>>> Note that mapPartitionsWithIndex is sort of the "main processing
>>>>>>> loop" for many Spark functions. It is iterating through all the 
>>>>>>> elements of
>>>>>>> the partition and doing some computation (probably running your user 
>>>>>>> code)
>>>>>>> on it.
>>>>>>>
>>>>>>> You can see the number of partitions in your RDD by visiting the
>>>>>>> Spark driver web interface. To access this, visit port 8080 on host 
>>>>>>> running
>>>>>>> your Standalone Master (assuming you're running standalone mode), which
>>>>>>> will have a link to the application web interface. The Tachyon master 
>>>>>>> also
>>>>>>> has a useful web interface, available at port 19999.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li <
>>>>>>> liqingyang1...@gmail.com> wrote:
>>>>>>>
>>>>>>>> hi, Mayur, thanks for replying.
>>>>>>>> I know spark application should take all cores by default. My
>>>>>>>> question is  how to set task number on each core ?
>>>>>>>> If one silce, one task,  how can i set silce file size ?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>>>>>>>>
>>>>>>>> How many cores do you see on your spark master (8080 port).
>>>>>>>>> By default spark application should take all cores when you launch
>>>>>>>>> it. Unless you have set max core configuration.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mayur Rustagi
>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li <
>>>>>>>>> liqingyang1...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> my aim of setting task number is to increase the query speed,
>>>>>>>>>> and I have also found " mapPartitionsWithIndex at
>>>>>>>>>> Operator.scala:333 <http://192.168.1.101:4040/stages/stage?id=17>"
>>>>>>>>>> is costing much time.  so, my another question is :
>>>>>>>>>> how to tunning 
>>>>>>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>>>>>>>> to make the costing time down?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>>>>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks
>>>>>>>>>>> each machine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com
>>>>>>>>>>> >:
>>>>>>>>>>>
>>>>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40
>>>>>>>>>>>> "  in shark-env.sh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <
>>>>>>>>>>>> liqingyang1...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>> i am using tachyon as storage system and using to shark to
>>>>>>>>>>>>> query a table which is a bigtable, i have 5 machines as a spark 
>>>>>>>>>>>>> cluster,
>>>>>>>>>>>>> there are 4 cores on each machine .
>>>>>>>>>>>>> My question is:
>>>>>>>>>>>>> 1. how to set task number on each core?
>>>>>>>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to control task number?

Reply via email to