Hi, Aaron, thanks for sharing. I am using shark to execute query , and table is created on tachyon. I think i can not using RDD#repartition() in shark CLI; if shark support "SET mapred.max.split.size" to control file size ? if yes, after i create table, i can control file num, then I can control task number. if not , do anyone know other way to control task number in shark CLI?
2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: > How many partitions are in your input data set? A possibility is that your > input data has 10 unsplittable files, so you end up with 10 partitions. You > could improve this by using RDD#repartition(). > > Note that mapPartitionsWithIndex is sort of the "main processing loop" for > many Spark functions. It is iterating through all the elements of the > partition and doing some computation (probably running your user code) on > it. > > You can see the number of partitions in your RDD by visiting the Spark > driver web interface. To access this, visit port 8080 on host running your > Standalone Master (assuming you're running standalone mode), which will > have a link to the application web interface. The Tachyon master also has a > useful web interface, available at port 19999. > > > On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1...@gmail.com>wrote: > >> hi, Mayur, thanks for replying. >> I know spark application should take all cores by default. My question >> is how to set task number on each core ? >> If one silce, one task, how can i set silce file size ? >> >> >> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>: >> >> How many cores do you see on your spark master (8080 port). >>> By default spark application should take all cores when you launch it. >>> Unless you have set max core configuration. >>> >>> >>> Mayur Rustagi >>> Ph: +1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Thu, May 22, 2014 at 4:07 PM, qingyang li >>> <liqingyang1...@gmail.com>wrote: >>> >>>> my aim of setting task number is to increase the query speed, and I >>>> have also found " mapPartitionsWithIndex at >>>> Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>" >>>> is costing much time. so, my another question is : >>>> how to tunning >>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17> >>>> to make the costing time down? >>>> >>>> >>>> >>>> >>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>> >>>> i have added SPARK_JAVA_OPTS+="-Dspark. >>>>> default.parallelism=40 " in shark-env.sh, >>>>> but i find there are only10 tasks on the cluster and 2 tasks each >>>>> machine. >>>>> >>>>> >>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>>> >>>>> i have added SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 " in >>>>>> shark-env.sh >>>>>> >>>>>> >>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>>>> >>>>>> i am using tachyon as storage system and using to shark to query a >>>>>>> table which is a bigtable, i have 5 machines as a spark cluster, there >>>>>>> are >>>>>>> 4 cores on each machine . >>>>>>> My question is: >>>>>>> 1. how to set task number on each core? >>>>>>> 2. where to see how many partitions of one RDD? >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >