Kindly let know if there are any pointers on how to improve response time for parquet data here.
Regards Jeena -----Original Message----- From: Jeena Vinod Sent: Tuesday, February 28, 2017 4:25 AM To: [email protected] Subject: RE: Explain Plan for Parquet data is taking a lot of timre Hi, I have 2 Drill 1.9 installations. One is a 5 node 32GB cluster and other is a 2 node 16GB cluster. And I am running the same query in both the places. select * from `testdata` where <condition> limit 100; testdata is 1GB uncompressed parquet data. The query response time is found as below: 2 node cluster - 13min 5 node cluster - 19min I was expecting 5 node cluster to be faster, but the results say otherwise. In the query profile, as expected, 5 node cluster has more minor fragments, but still the scan time is higher. Attached the json profile for both. Is this in anyway related to the max batches/max records for row group scan? Any suggestions on how we can get better response time in the 5 node cluster is appreciated. Regards Jeena -----Original Message----- From: Jeena Vinod Sent: Sunday, February 26, 2017 2:22 AM To: [email protected] Subject: RE: Explain Plan for Parquet data is taking a lot of timre Please find attached the full JSON profile. Regards Jeena -----Original Message----- From: Padma Penumarthy [mailto:[email protected]] Sent: Saturday, February 25, 2017 3:31 AM To: [email protected] Subject: Re: Explain Plan for Parquet data is taking a lot of timre Yes, please do send the JSON profile. Thanks, Padma > On Feb 24, 2017, at 1:56 PM, Jeena Vinod <[email protected]> wrote: > > Thanks for the suggestions. > > I did run REFRESH TABLE METADATA command on this path before firing select > query. > > In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 2 > node 16GB cluster and here select * with limit 100 is taking less time than > 1.8, though the number of rows in ParquetGroupScan remains unchanged. Select > query is taking around 8 minutes and explain plan took around 7 minutes. Also > in the Web console profile, the query stays in the STARTING status for almost > 7 minutes. > > Query Plan for 1.9: > 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative > cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1721 > 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, > cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 > memory}, id = 1720 > 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount > = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, > 0.0 memory}, id = 1719 > 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = > 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 > memory}, id = 1718 > 00-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], > selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, > cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): > rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 > network, 0.0 memory}, id = 1717 > > And from the query profile, it looks like the most time is spent in > PARQUET_ROW_GROUP_SCAN. I can attach the full JSON profile if it helps. > > Can there be further improvement in performance with 1.9? > > Regards > Jeena > > > -----Original Message----- > From: Padma Penumarthy [mailto:[email protected]] > Sent: Friday, February 24, 2017 11:22 PM > To: [email protected] > Subject: Re: Explain Plan for Parquet data is taking a lot of timre > > Yes, limit is pushed down to parquet reader in 1.9. But, that will not help > with planning time. > It is definitely worth trying with 1.9 though. > > Thanks, > Padma > > >> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <[email protected]> >> wrote: >> >> Looks like the metadata cache is being used "usedMetadataFile=true, ". But >> to be sure did you perform a REFRESH TABLE METADATA <path to table> on the >> parquet data? >> >> >> However it looks like it is reading a full batch " rowcount = 32600.0, >> cumulative cost = {32600.0 rows, 32600.0" >> >> >> Didn't the limit operator get pushed down to the parquet reader in 1.9? >> >> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets >> reduced to 100. >> >> >> Can you look in the query profile where time is spend, also how long it >> takes before the query starts to run in the WebUI profile. >> >> >> Best Regards >> >> >> Andries Engelbrecht >> >> >> Senior Solutions Architect >> >> MapR Alliances and Channels Engineering >> >> >> [email protected] >> >> >> [1483990071965_mapr-logo-signature.png] >> >> ________________________________ >> From: Jinfeng Ni <[email protected]> >> Sent: Thursday, February 23, 2017 4:53:34 PM >> To: user >> Subject: Re: Explain Plan for Parquet data is taking a lot of timre >> >> The reason the plan shows only one single parquet file is because >> "LIMIT 100" is applied and filter out the rest of them. >> >> Agreed that parquet metadata caching might help reduce planning time, >> when there are large number of parquet files. >> >> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli >> <[email protected]> wrote: >>> You said there are 2144 parquet files but the plan suggests that you >>> only have a single parquet file. In any case its a long time to plan the >>> query. >>> Did you try the metadata caching feature [1]? >>> >>> Also how many rowgroups and columns are present in the parquet file? >>> >>> [1] >>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/ >>> >>> - Rahul >>> >>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> >>>> >>>> Drill is taking 23 minutes for a simple select * query with limit >>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query >>>> is also taking that long(~23 minutes). >>>> >>>> Query: select * from <plugin>.root.`testdata` limit 100; >>>> >>>> Query Plan: >>>> >>>> 00-00 Screen : rowType = RecordType(ANY *): rowcount = 100.0, >>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, >>>> 0.0 memory}, id = 1429 >>>> >>>> 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = >>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 >>>> network, >>>> 0.0 memory}, id = 1428 >>>> >>>> 00-02 SelectionVectorRemover : rowType = (DrillRecordRow[*]): >>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 >>>> io, 0.0 network, 0.0 memory}, id = 1427 >>>> >>>> 00-03 Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): >>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 >>>> io, 0.0 network, 0.0 memory}, id = 1426 >>>> >>>> 00-04 Scan(groupscan=[ParquetGroupScan >>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000- >>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], >>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, >>>> cacheFileRoot=/testdata, >>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = >>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 >>>> network, 0.0 memory}, id = 1425 >>>> >>>> >>>> >>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the >>>> data is in Oracle Storage Cloud Service. When I run the same query >>>> on 1GB TSV file in this location it is taking only 38 seconds . >>>> >>>> Also testdata contains around 2144 .parquet files each around 500KB. >>>> >>>> >>>> >>>> Is there any additional configuration required for parquet? >>>> >>>> Kindly suggest how to improve the response time here. >>>> >>>> >>>> >>>> Regards >>>> Jeena >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >
