when i using "create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 limit 400000;" , there will be 4 files created on tachyon. but when i using "create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 ;" , there will be 35 files created on tachyon. so, I think spark/shark know how to split files when creating table, spark/shark will partition table into many parts on tatchyon? how spark/shark split table into many parts? could i control it's spliting by setting some configuration ,such as setting "map.split.size=64M" ?
2014-05-27 16:59 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > when i using "create table bigtable002 tblproperties('shark.cache'=' > tachyon') as select * from bigtable001 limit 400000;" , there will be 4 > files created on tachyon. > but when i using "create table bigtable002 tblproperties('shark.cache'=' > tachyon') as select * from bigtable001 ;" , there will be 35 files > created on tachyon. > so, I think spark/shark know how to split files when creating table, > could i control it's spliting by setting some configuration ,such as > setting "map.split.size=64M" ? > > > 2014-05-26 12:14 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > > I using " create table bigtable002 tblproperties('shark.cache'='tachyon') >> as select * from bigtable001" to create table bigtable002; while >> bigtable001 is load from hdfs, it's format is text file , so i think >> bigtable002's is text. >> >> >> 2014-05-26 11:14 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: >> >> What is the format of your input data, prior to insertion into Tachyon? >>> >>> >>> On Sun, May 25, 2014 at 7:52 PM, qingyang li >>> <liqingyang1...@gmail.com>wrote: >>> >>>> i tried "set mapred.map.tasks=30" , it does not work, it seems shark >>>> does not support this setting. >>>> i also tried "SET mapred.max.split.size=64000000", it does not >>>> work,too. >>>> is there other way to control task number in shark CLI ? >>>> >>>> >>>> >>>> 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: >>>> >>>> You can try setting "mapred.map.tasks" to get Hive to do the right >>>>> thing. >>>>> >>>>> >>>>> On Sun, May 25, 2014 at 7:27 PM, qingyang li <liqingyang1...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi, Aaron, thanks for sharing. >>>>>> >>>>>> I am using shark to execute query , and table is created on tachyon. >>>>>> I think i can not using RDD#repartition() in shark CLI; >>>>>> if shark support "SET mapred.max.split.size" to control file size ? >>>>>> if yes, after i create table, i can control file num, then I can >>>>>> control task number. >>>>>> if not , do anyone know other way to control task number in shark CLI? >>>>>> >>>>>> >>>>>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: >>>>>> >>>>>> How many partitions are in your input data set? A possibility is that >>>>>>> your input data has 10 unsplittable files, so you end up with 10 >>>>>>> partitions. You could improve this by using RDD#repartition(). >>>>>>> >>>>>>> Note that mapPartitionsWithIndex is sort of the "main processing >>>>>>> loop" for many Spark functions. It is iterating through all the >>>>>>> elements of >>>>>>> the partition and doing some computation (probably running your user >>>>>>> code) >>>>>>> on it. >>>>>>> >>>>>>> You can see the number of partitions in your RDD by visiting the >>>>>>> Spark driver web interface. To access this, visit port 8080 on host >>>>>>> running >>>>>>> your Standalone Master (assuming you're running standalone mode), which >>>>>>> will have a link to the application web interface. The Tachyon master >>>>>>> also >>>>>>> has a useful web interface, available at port 19999. >>>>>>> >>>>>>> >>>>>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li < >>>>>>> liqingyang1...@gmail.com> wrote: >>>>>>> >>>>>>>> hi, Mayur, thanks for replying. >>>>>>>> I know spark application should take all cores by default. My >>>>>>>> question is how to set task number on each core ? >>>>>>>> If one silce, one task, how can i set silce file size ? >>>>>>>> >>>>>>>> >>>>>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>: >>>>>>>> >>>>>>>> How many cores do you see on your spark master (8080 port). >>>>>>>>> By default spark application should take all cores when you launch >>>>>>>>> it. Unless you have set max core configuration. >>>>>>>>> >>>>>>>>> >>>>>>>>> Mayur Rustagi >>>>>>>>> Ph: +1 (760) 203 3257 >>>>>>>>> http://www.sigmoidanalytics.com >>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li < >>>>>>>>> liqingyang1...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> my aim of setting task number is to increase the query speed, >>>>>>>>>> and I have also found " mapPartitionsWithIndex at >>>>>>>>>> Operator.scala:333 <http://192.168.1.101:4040/stages/stage?id=17>" >>>>>>>>>> is costing much time. so, my another question is : >>>>>>>>>> how to tunning >>>>>>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17> >>>>>>>>>> to make the costing time down? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com> >>>>>>>>>> : >>>>>>>>>> >>>>>>>>>> i have added SPARK_JAVA_OPTS+="-Dspark. >>>>>>>>>>> default.parallelism=40 " in shark-env.sh, >>>>>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks >>>>>>>>>>> each machine. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com >>>>>>>>>>> >: >>>>>>>>>>> >>>>>>>>>>> i have added SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 >>>>>>>>>>>> " in shark-env.sh >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li < >>>>>>>>>>>> liqingyang1...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>> i am using tachyon as storage system and using to shark to >>>>>>>>>>>>> query a table which is a bigtable, i have 5 machines as a spark >>>>>>>>>>>>> cluster, >>>>>>>>>>>>> there are 4 cores on each machine . >>>>>>>>>>>>> My question is: >>>>>>>>>>>>> 1. how to set task number on each core? >>>>>>>>>>>>> 2. where to see how many partitions of one RDD? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >