I using " create table bigtable002 tblproperties('shark.cache'='tachyon')
as select * from bigtable001"  to create table bigtable002; while
bigtable001 is load from hdfs, it's format is text file ,  so i think
bigtable002's is text.


2014-05-26 11:14 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:

> What is the format of your input data, prior to insertion into Tachyon?
>
>
> On Sun, May 25, 2014 at 7:52 PM, qingyang li <liqingyang1...@gmail.com>wrote:
>
>> i tried "set mapred.map.tasks=30" , it does not work, it seems shark
>> does not support this setting.
>> i also tried "SET mapred.max.split.size=64000000", it does not work,too.
>> is there other way to control task number in shark CLI ?
>>
>>
>>
>> 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>
>> You can try setting "mapred.map.tasks" to get Hive to do the right thing.
>>>
>>>
>>> On Sun, May 25, 2014 at 7:27 PM, qingyang li 
>>> <liqingyang1...@gmail.com>wrote:
>>>
>>>> Hi, Aaron, thanks for sharing.
>>>>
>>>> I am using shark to execute query , and table is created on tachyon. I
>>>> think i can not using RDD#repartition() in shark CLI;
>>>> if shark support "SET mapred.max.split.size" to control file size ?
>>>> if yes,  after i create table, i can control file num,  then   I can
>>>> control task number.
>>>> if not , do anyone know other way to control task number in shark CLI?
>>>>
>>>>
>>>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>>>
>>>> How many partitions are in your input data set? A possibility is that
>>>>> your input data has 10 unsplittable files, so you end up with 10
>>>>> partitions. You could improve this by using RDD#repartition().
>>>>>
>>>>> Note that mapPartitionsWithIndex is sort of the "main processing loop"
>>>>> for many Spark functions. It is iterating through all the elements of the
>>>>> partition and doing some computation (probably running your user code) on
>>>>> it.
>>>>>
>>>>> You can see the number of partitions in your RDD by visiting the Spark
>>>>> driver web interface. To access this, visit port 8080 on host running your
>>>>> Standalone Master (assuming you're running standalone mode), which will
>>>>> have a link to the application web interface. The Tachyon master also has 
>>>>> a
>>>>> useful web interface, available at port 19999.
>>>>>
>>>>>
>>>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> hi, Mayur, thanks for replying.
>>>>>> I know spark application should take all cores by default. My
>>>>>> question is  how to set task number on each core ?
>>>>>> If one silce, one task,  how can i set silce file size ?
>>>>>>
>>>>>>
>>>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>>>>>>
>>>>>> How many cores do you see on your spark master (8080 port).
>>>>>>> By default spark application should take all cores when you launch
>>>>>>> it. Unless you have set max core configuration.
>>>>>>>
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li <
>>>>>>> liqingyang1...@gmail.com> wrote:
>>>>>>>
>>>>>>>> my aim of setting task number is to increase the query speed,
>>>>>>>> and I have also found " mapPartitionsWithIndex at
>>>>>>>> Operator.scala:333 <http://192.168.1.101:4040/stages/stage?id=17>"
>>>>>>>> is costing much time.  so, my another question is :
>>>>>>>> how to tunning 
>>>>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>>>>>> to make the costing time down?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>>>
>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks each
>>>>>>>>> machine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>>>>
>>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "
>>>>>>>>>> in shark-env.sh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> i am using tachyon as storage system and using to shark to query
>>>>>>>>>>> a table which is a bigtable, i have 5 machines as a spark cluster, 
>>>>>>>>>>> there
>>>>>>>>>>> are 4 cores on each machine .
>>>>>>>>>>> My question is:
>>>>>>>>>>> 1. how to set task number on each core?
>>>>>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to