Hi Jeena
The JSON profile does not reveal much about why the planning time took so long, but only give you information from the physical plan and when the planning approximately completed (7+min for 2node; 15+min for 5node). Drillbit logs, however, will give you more information. For this, you'll need to look in the log for with information like 2017-02-23 14:00:54,609 [27513143-8718-7a47-a2d4-06850755568a:foreman] DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - VOLCANO:Physical Planning (49588ms): You might need to enable your logback.xml to pop out this information (by enabling DEBUG level logging for these classes). These are the recommended loggers you can enable DEBUG for: org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler org.apache.drill.exec.work.foreman.Foreman You can share the drillbit log file with these as a run thereafter. Also, Your profile JSONs indicate that you have a fairly slow underlying filesystem. I'm seeing an average of 3m45s to read 945K rows (2 node setup) and 2m25s to read 1.5M rows (5node setup) Your 2node setup shows 6 fragments processing 29 batches; while 5node setup shows 15 fragments processing 46 batches. For the number of rows, the amount of time spent is very high, which makes me believe that your filesystem (Oracle Cloud Storage service) is itself quite slow. For speeding up execution, you can try changing the planner.width.max_per_node to a higher value (like the number of cores on the node). This should increase parallelization and the utilization of all the cores by Drill. Kunal Khatua Engineering [MapR]<http://www.mapr.com/> www.mapr.com<http://www.mapr.com/> ________________________________ From: Jeena Vinod <[email protected]> Sent: Tuesday, February 28, 2017 12:24:52 PM To: [email protected] Subject: RE: Explain Plan for Parquet data is taking a lot of timre Kindly let know if there are any pointers on how to improve response time for parquet data here. Regards Jeena -----Original Message----- From: Jeena Vinod Sent: Tuesday, February 28, 2017 4:25 AM To: [email protected] Subject: RE: Explain Plan for Parquet data is taking a lot of timre Hi, I have 2 Drill 1.9 installations. One is a 5 node 32GB cluster and other is a 2 node 16GB cluster. And I am running the same query in both the places. select * from `testdata` where <condition> limit 100; testdata is 1GB uncompressed parquet data. The query response time is found as below: 2 node cluster - 13min 5 node cluster - 19min I was expecting 5 node cluster to be faster, but the results say otherwise. In the query profile, as expected, 5 node cluster has more minor fragments, but still the scan time is higher. Attached the json profile for both. Is this in anyway related to the max batches/max records for row group scan? Any suggestions on how we can get better response time in the 5 node cluster is appreciated. Regards Jeena -----Original Message----- From: Jeena Vinod Sent: Sunday, February 26, 2017 2:22 AM To: [email protected] Subject: RE: Explain Plan for Parquet data is taking a lot of timre Please find attached the full JSON profile. Regards Jeena -----Original Message----- From: Padma Penumarthy [mailto:[email protected]] Sent: Saturday, February 25, 2017 3:31 AM To: [email protected] Subject: Re: Explain Plan for Parquet data is taking a lot of timre Yes, please do send the JSON profile. Thanks, Padma > On Feb 24, 2017, at 1:56 PM, Jeena Vinod <[email protected]> wrote: > > Thanks for the suggestions. > > I did run REFRESH TABLE METADATA command on this path before firing select > query. > > In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 2 > node 16GB cluster and here select * with limit 100 is taking less time than > 1.8, though the number of rows in ParquetGroupScan remains unchanged. Select > query is taking around 8 minutes and explain plan took around 7 minutes. Also > in the Web console profile, the query stays in the STARTING status for almost > 7 minutes. > > Query Plan for 1.9: > 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative > cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1721 > 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, > cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 > memory}, id = 1720 > 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount > = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, > 0.0 memory}, id = 1719 > 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = > 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 > memory}, id = 1718 > 00-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], > selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, > cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): > rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 > network, 0.0 memory}, id = 1717 > > And from the query profile, it looks like the most time is spent in > PARQUET_ROW_GROUP_SCAN. I can attach the full JSON profile if it helps. > > Can there be further improvement in performance with 1.9? > > Regards > Jeena > > > -----Original Message----- > From: Padma Penumarthy [mailto:[email protected]] > Sent: Friday, February 24, 2017 11:22 PM > To: [email protected] > Subject: Re: Explain Plan for Parquet data is taking a lot of timre > > Yes, limit is pushed down to parquet reader in 1.9. But, that will not help > with planning time. > It is definitely worth trying with 1.9 though. > > Thanks, > Padma > > >> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]> >> wrote: >> >> Looks like the metadata cache is being used "usedMetadataFile=true, ". But >> to be sure did you perform a REFRESH TABLE METADATA <path to table> on the >> parquet data? >> >> >> However it looks like it is reading a full batch " rowcount = 32600.0, >> cumulative cost = {32600.0 rows, 32600.0" >> >> >> Didn't the limit operator get pushed down to the parquet reader in 1.9? >> >> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets >> reduced to 100. >> >> >> Can you look in the query profile where time is spend, also how long it >> takes before the query starts to run in the WebUI profile. >> >> >> Best Regards >> >> >> Andries Engelbrecht >> >> >> Senior Solutions Architect >> >> MapR Alliances and Channels Engineering >> >> >> [email protected] >> >> >> [1483990071965_mapr-logo-signature.png] >> >> ________________________________ >> From: Jinfeng Ni <[email protected]> >> Sent: Thursday, February 23, 2017 4:53:34 PM >> To: user >> Subject: Re: Explain Plan for Parquet data is taking a lot of timre >> >> The reason the plan shows only one single parquet file is because >> "LIMIT 100" is applied and filter out the rest of them. >> >> Agreed that parquet metadata caching might help reduce planning time, >> when there are large number of parquet files. >> >> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli >> <[email protected]> wrote: >>> You said there are 2144 parquet files but the plan suggests that you >>> only have a single parquet file. In any case its a long time to plan the >>> query. >>> Did you try the metadata caching feature [1]? >>> >>> Also how many rowgroups and columns are present in the parquet file? >>> >>> [1] >>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/ >>> >>> - Rahul >>> >>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> >>>> >>>> Drill is taking 23 minutes for a simple select * query with limit >>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query >>>> is also taking that long(~23 minutes). >>>> >>>> Query: select * from <plugin>.root.`testdata` limit 100; >>>> >>>> Query Plan: >>>> >>>> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, >>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, >>>> 0.0 memory}, id = 1429 >>>> >>>> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = >>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 >>>> network, >>>> 0.0 memory}, id = 1428 >>>> >>>> 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): >>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 >>>> io, 0.0 network, 0.0 memory}, id = 1427 >>>> >>>> 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): >>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 >>>> io, 0.0 network, 0.0 memory}, id = 1426 >>>> >>>> 00-04 Scan(groupscan=[ParquetGroupScan >>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000- >>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], >>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, >>>> cacheFileRoot=/testdata, >>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = >>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 >>>> network, 0.0 memory}, id = 1425 >>>> >>>> >>>> >>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the >>>> data is in Oracle Storage Cloud Service. When I run the same query >>>> on 1GB TSV file in this location it is taking only 38 seconds . >>>> >>>> Also testdata contains around 2144 .parquet files each around 500KB. >>>> >>>> >>>> >>>> Is there any additional configuration required for parquet? >>>> >>>> Kindly suggest how to improve the response time here. >>>> >>>> >>>> >>>> Regards >>>> Jeena >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >
