Hi Dechang, Thanks very much for your help!
I get a little confused here, why does skew exist? After some statistic work, I got this: 1516 files and 102.54MB on average, max of 104MB, min of 95MB. On a system of 72 compute resources, the inputs is somewhat even distributed on three nodes: Server80: 520 files, Server83: 486 files, Server166: 509 files. If handling a file takes equal time for one resource/minor fragment/slot, why does such a big skew exist? Any way to find out and eliminate this? My HDFS block size is 256MB, and all my files are of 100MB on average, perhaps not small IO? On Wed, May 25, 2016 at 8:32 AM, Dechang Gu <[email protected]> wrote: > Hi Yijie, > Thanks for the profile. Looks like from the Operator Profile overview, > 03-xx-02 HASH_AGGREGATE and 03-xx-06 PARQUET_ROW_GROUP_SCAN took the most > of time: > > 03-xx-02 HASH_AGGREGATE 0.020s 0.083s 0.213s 1m06s 1m55s *3m12s* 0.000s > 0.000s 0.000s 16MB 16MB > 03-xx-03 PROJECT 0.004s 0.009s 0.024s 6.539s 11.249s 16.504s 0.000s 0.000s > 0.000s 536KB 536KB > 03-xx-04 SELECTION_VECTOR_REMOVER 0.002s 0.004s 0.014s 34.456s 58.595s > 1m26s 0.000s 0.000s 0.000s 664KB 664KB > 03-xx-05 FILTER 0.001s 0.003s 0.009s 2.556s 4.355s 6.543s 0.000s 0.000s > 0.000s 8MB 8MB > 03-xx-06 PARQUET_ROW_GROUP_SCAN 0.000s 0.000s 0.000s 2m13s 3m40s *5m14s* > 5.392s 6.372s 7.371s 12MB 12MB > For the SCAN, there are some skewness among the batches/records processed > by the minor fragments: from ~5000 batches to +15000 batches The 5m > process times are for the bigger batches. > > Similar skewness is also shown in 03-xx-02. If the load can be balanced > among the fragments, it may speed up the query a little bit. > > On the other hand, the 10min query time is not that much off the line, > considering that you have only 3 nodes and each node is with only 7 HDDs. I > did a similar run of the query on my > cluster (10 nodes each with 32 cores and 23 HDDs), For 03-xx-02 and > 03-xx-06, the process time per record is about the same as what you got, > even though the total query time is much short due to it is on much bigger > system. > > Another note: for your test, disk IO is about ~24MB/s/HDD > (150GB/300s/21HDD), which is about the IO limit if there are a lot small > IOs (what is the chuck size of your HDFS?) > > Hope this helps, let me know If you have any further questions. > > Thanks, > Dechang > > > > > On Mon, May 23, 2016 at 7:28 PM, Yijie Shen <[email protected]> > wrote: > >> Hi Dechang, Thanks for your reply. >> >> I've changed the parallelism of each node from 17 to 24 but no >> performance improvements seen. >> >> The 1516 files are distributed in HDFS constitute 3 datanodes, same 3 >> servers where I start DrillBits. For each datanode, 7 HDD is configured as >> dfs.datanode.data.dir. >> The data was generated from TPCH dbgen tool and loaded into parquet >> format using Apache Spark, the file size are 101MB on average, no skew seen. >> >> I've paste the `Full json profile` content and attached to this mail. >> >> Thank you. >> >> On Mon, May 23, 2016 at 11:12 PM, Dechang Gu <[email protected]> wrote: >> >>> Hi Yijie, >>> This is Dechang at MapR. I work on Drill performance. >>> >>> From what you described, looks like scan took most of the time. >>> How are the files are distributed on the disks, are there any skew? >>> How many disks are there? >>> If possible can you provide the profile for the run? >>> >>> Thanks, >>> Dechang >>> >>> On Sun, May 22, 2016 at 9:06 AM, Yijie Shen <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I'm trying out Drill on master branch lately and have deployed a >>>> cluster on >>>> three physical server. >>>> >>>> The input data `lineitem` is in parquet format of total size 150GB, >>>> 101MB >>>> per file and 1516 files in total. >>>> >>>> The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 >>>> cores in >>>> total, 32GB memory. >>>> >>>> While executing Q1 using: >>>> >>>> SELECT >>>> L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), >>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)), >>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY), >>>> AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1) >>>> FROM >>>> dfs.tpch.`lineitem` >>>> WHERE >>>> L_SHIPDATE<='1998-09-02' >>>> GROUP BY L_RETURNFLAG, L_LINESTATUS >>>> ORDER BY L_RETURNFLAG, L_LINESTATUS >>>> >>>> I've noticed the parallelism was 51 (planner.width.max_per_node = 17) >>>> in my >>>> case for Major Fragment 03 (Scan Filter Project HashAgg and Project), >>>> and >>>> each Minor fragment last about 8 to 9 minutes. one record for example: >>>> >>>> 03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32 >>>> 33MB FINISHED >>>> >>>> Is this a normal speed (more than 10 minutes) for Drill for my current >>>> cluster? Did I miss something important in conf to accelerate the >>>> execution? >>>> >>>> Thanks very much! >>>> >>>> Yijie >>>> >>> >>> >> >
