I ran that EXPLAIN that you suggested against the entire 100 file table and it takes about 3 seconds. I will try to get a defect written up in the next few days.
- Dave On Tue, Feb 14, 2017 at 9:06 PM, Jinfeng Ni <[email protected]> wrote: > From the two tests you did, I'm inclined to think there might be some > special things in your parquet files. How do you generate these > parquet files? Do they contain normal data type (int/float/varchar), > or complex type (array/map)? > > In our environment, we also have hundreds of parquet files, each with > size ~ hundreds of MBs. A typical query (several tables joined) would > takes a couple of seconds in planning. > > One more test if you can help run. > > EXPLAIN PLAN FOR > SELECT someCol1, someCol2 > FROM dfs.`parquet/transaction/OneSingleFile.parquet`; > > The above query is simple enough that planner should not spend long > time in enumerating different choices. If it still takes long time for > query planning, the more likely cause might be in parquet files you > used. > > > > On Tue, Feb 14, 2017 at 1:06 PM, David Kincaid <[email protected]> > wrote: > > I will write up a defect. The first test you suggested below - running > the > > query on just one of our Parquet files produces the same result (10-12 > > minute planning time). However, the second test - using > > cp.`tpch/nation.parquet` - results in a planning time of only about a > > minute. So, I'm not sure how to interpret that. What does that mean to > you > > all? > > > > - Dave > > > > On Tue, Feb 14, 2017 at 12:37 PM, Jinfeng Ni <[email protected]> wrote: > > > >> Normally, the slow query planning could be caused by : > >> > >> 1. Some planner rule hit a bug when processing certain operators in > >> the query, for instance join operator, distinct aggregate. The query > >> I tried on a small file seems to rule out this possibility. > >> 2. The parquet metadata access time. According to the long, this does > >> not seem to be the issue. > >> 3. Something we are not aware of. > >> > >> To help get some clue, can you help do the following: > >> 1. run the query over one single parquet files, in stead of 100 > >> parquet files? You can change using > >> dfs.`parquet/transaction/OneSingleFile.parquet`. I'm wondering if the > >> planning time is proportional to # of parquet files. > >> > >> 2. What if you try your query by replacing > >> dfs.`parquet/transaction/OneSingleFile.parquet` with > >> cp.`tpch/nation.parquet` which is a small tpch parquet file (you need > >> re-enable the storage plugin 'cp')? Run EXPLAIN should be fine. This > >> will tell us if the problem is caused by the parquet source, or the > >> query itself. > >> > >> Yes, please create a defect in Drill JIRA. > >> > >> On Tue, Feb 14, 2017 at 5:02 AM, David Kincaid <[email protected]> > >> wrote: > >> > Thank you for the feedback. It seems there is nothing more I can do > on my > >> > end. What are my next steps? Shall I create a defect in the Drill > Jira? > >> > > >> > - Dave > >> > > >> > On Mon, Feb 13, 2017 at 5:13 PM, Jinfeng Ni <[email protected]> wrote: > >> > > >> >> The size of parquet files will matter in terms of meta data access > >> >> time, which is just 212 ms according to your log file. My > >> >> understanding is it does not matter too much to the overall planning > >> >> times. That's why it probably makes sense to try over such a small > toy > >> >> example. > >> >> > >> >> Normally the planning time for such simple query should be much > >> >> shorter than 12 minutes. It indicates it could be caused by a code > >> >> bug, or something else that we are currently unaware of. > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Mon, Feb 13, 2017 at 2:47 PM, David Kincaid < > [email protected]> > >> >> wrote: > >> >> > The example in DRILL-5183 is just a very small toy example to > >> demonstrate > >> >> > the bug with how Drill reads Parquet array fields. It doesn't have > >> >> anything > >> >> > to do with this planning issue (at least I don't think it does). > Sorry > >> >> if I > >> >> > confused things with that reference. > >> >> > > >> >> > I just tried running our query directly against the table at > >> >> > dfs.`parquet/transaction` and get the same result (12 minutes of > >> planning > >> >> > time). I disabled the cp and s3 storage plugins that were enabled > so > >> that > >> >> > only the dfs storage plugin is enabled and the result is the same. > >> >> > > >> >> > Is this expected for Drill to take this long in the planning phase > >> for a > >> >> > query? Is there anything else I can try or information I could > >> provide to > >> >> > help identify the bug (seems like a bug to me)? I really appreciate > >> you > >> >> > guys helping out so quickly this afternoon. > >> >> > > >> >> > - Dave > >> >> > > >> >> > On Mon, Feb 13, 2017 at 4:13 PM, Jinfeng Ni <[email protected]> > wrote: > >> >> > > >> >> >> I downloaded books.parquet from DRILL-5183, and created a view on > top > >> >> >> of this single parquet file. Then, run EXPLAIN for the query, and > it > >> >> >> completes within 1.2 seconds on Drill 1.8.0 release. (The # of > >> parquet > >> >> >> files would impact the time to fetch metadata. Since it's not the > >> >> >> bottleneck in this case, it should not cause a big difference). > >> >> >> > >> >> >> Do you see the long planning time issue for this query only, or it > >> >> >> happens for other queries as well? Besides the possibility of > >> planning > >> >> >> rule bugs, we once saw another possible cause of long planning > issue. > >> >> >> In your storage plugin configuration, if you enable some other > >> storage > >> >> >> plugin (for instance, hbase, or hive etc) which are slow to > access, > >> >> >> then those un-relevant storage plugin might impact your query as > >> well. > >> >> >> You may temporarily disable those storage plugins, and see if it's > >> the > >> >> >> cause of the problem. > >> >> >> > >> >> >> 0: jdbc:drill:zk=local> explain plan for > >> >> >> . . . . . . . . . . . > select fltb1.sapId, yearmo, > >> >> >> . . . . . . . . . . . > COUNT(*) as totalcnt, > >> >> >> . . . . . . . . . . . > count(distinct(CASE > >> >> >> . . . . . . . . . . . > WHEN > >> >> >> . . . . . . . . . . . > (REPEATED_CONTAINS(fltb1. > >> >> >> classLabels, > >> >> >> . . . . . . . . . . . > > >> >> >> 'Thing:Service:MedicalService:Diagnostic:Radiology: > Ultrasound.*')) > >> >> >> . . . . . . . . . . . > THEN fltb1.invoiceId > >> >> >> . . . . . . . . . . . > END)) as ultracount, > >> >> >> . . . . . . . . . . . > count(distinct (CASE > >> >> >> . . . . . . . . . . . > WHEN > >> >> >> . . . . . . . . . . . > (REPEATED_CONTAINS(fltb1. > >> >> >> classLabels, > >> >> >> . . . . . . . . . . . > > >> >> >> 'Thing:Service:MedicalService:Diagnostic:LaboratoryTest.*')) > >> >> >> . . . . . . . . . . . > THEN fltb1.invoiceId > >> >> >> . . . . . . . . . . . > END)) as labcount > >> >> >> . . . . . . . . . . . > from ( > >> >> >> . . . . . . . . . . . > select sapid, invoiceId, > >> >> >> . . . . . . . . . . . > TO_CHAR(TO_TIMESTAMP( > >> transactionDate, > >> >> >> 'YYYY-MM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM') yearmo, > >> >> >> . . . . . . . . . . . > classLabels > >> >> >> . . . . . . . . . . . > from dfs.tmp.transactionView) fltb1 > >> >> >> . . . . . . . . . . . > group by fltb1.sapId, yearmo; > >> >> >> +------+------+ > >> >> >> | text | json | > >> >> >> +------+------+ > >> >> >> | 00-00 Screen > >> >> >> 00-01 Project(sapId=[$0], yearmo=[$1], totalcnt=[$2], > >> >> >> ultracount=[$3], labcount=[$4]) > >> >> >> .................................... > >> >> >> 00-09 SelectionVectorRemover > >> >> >> 00-12 Sort(sort0=[$0], sort1=[$1], dir0=[ASC], > >> >> dir1=[ASC]) > >> >> >> 00-15 HashAgg(group=[{0, 1}], > totalcnt=[COUNT()]) > >> >> >> ................................ > >> >> >> 00-22 Scan(groupscan=[ParquetGroupScan > >> >> >> [entries=[ReadEntryWithPath [path=file:/tmp/parquet/ > transaction]], > >> >> >> selectionRoot=file:/tmp/parquet/transaction, numFiles=1, > >> >> >> usedMetadataFile=false, columns=[`sapId`, `invoiceId`, > >> >> >> `transactionDate`, `classLabels`.`array`]]]) > >> >> >> > >> >> >> 1 row selected (1.195 seconds) > >> >> >> > >> >> >> > >> >> >> On Mon, Feb 13, 2017 at 1:51 PM, David Kincaid < > >> [email protected]> > >> >> >> wrote: > >> >> >> > Here is the entire transactionView.view.drill file. As you can > see > >> the > >> >> >> view > >> >> >> > itself is very simple and is just wrapping a syntactic problem > with > >> >> the > >> >> >> > array field. That's an issue I reported in Jira under > DRILL-5183 ( > >> >> >> > https://issues.apache.org/jira/browse/DRILL-5183) > >> >> >> > > >> >> >> > { > >> >> >> > "name" : "transactionView", > >> >> >> > "sql" : "SELECT `transactionRowKey`, `sapId`, `practiceName`, > >> >> >> > `practiceCity`, `practiceState`, `practicePostalCode`, > `animalId`, > >> >> >> > `dateOfBirth`, `species`, `breed`, `gender`, `status`, > `ownerId`, > >> >> >> > `itemType`, `classification`, `subclass`, `practiceDescription`, > >> >> >> > `clientDescription`, `invoiceId`, `unitOfMeasure`, `vendorName`, > >> >> >> `vaccine`, > >> >> >> > `rabies`, `vaccineType`, `price`, `quantity`, `transactionDate`, > >> >> >> > `visitReason`, `speciesCode`, `genderCode`, > >> >> `t`.`classLabels`['array'] AS > >> >> >> > `classLabels`\nFROM `dfs`.`/parquet/transaction` AS `t`", > >> >> >> > "fields" : [ { > >> >> >> > "name" : "transactionRowKey", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "sapId", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "practiceName", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "practiceCity", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "practiceState", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "practicePostalCode", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "animalId", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "dateOfBirth", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "species", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "breed", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "gender", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "status", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "ownerId", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "itemType", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "classification", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "subclass", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "practiceDescription", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "clientDescription", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "invoiceId", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "unitOfMeasure", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "vendorName", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "vaccine", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "rabies", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "vaccineType", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "price", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "quantity", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "transactionDate", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "visitReason", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "speciesCode", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "genderCode", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > }, { > >> >> >> > "name" : "classLabels", > >> >> >> > "type" : "ANY", > >> >> >> > "isNullable" : true > >> >> >> > } ], > >> >> >> > "workspaceSchemaPath" : [ ] > >> >> >> > } > >> >> >> > > >> >> >> > On Mon, Feb 13, 2017 at 3:47 PM, Jinfeng Ni <[email protected]> > >> wrote: > >> >> >> > > >> >> >> >> Yes, the log confirmed that the planning, especially physical > >> >> >> >> planning, is the one that took most of the time. > >> >> >> >> > >> >> >> >> If the definition of view s3.cisexport.transactionView is not > very > >> >> >> >> complicated (involves large # of tables), then it's possible > that > >> >> some > >> >> >> >> planner rules have a bug. (In the past, we once saw couple of > >> planner > >> >> >> >> rules would be fired in a loop). > >> >> >> >> > >> >> >> >> Is it possible that you can share the DDL of the view? That > may > >> help > >> >> >> >> us re-produce the problem and take a look at the trace of > Calcite, > >> >> >> >> which Drill uses as the query planner. > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> > >> >> > >> >
