Thank you for the feedback. It seems there is nothing more I can do on my end. What are my next steps? Shall I create a defect in the Drill Jira?
- Dave On Mon, Feb 13, 2017 at 5:13 PM, Jinfeng Ni <[email protected]> wrote: > The size of parquet files will matter in terms of meta data access > time, which is just 212 ms according to your log file. My > understanding is it does not matter too much to the overall planning > times. That's why it probably makes sense to try over such a small toy > example. > > Normally the planning time for such simple query should be much > shorter than 12 minutes. It indicates it could be caused by a code > bug, or something else that we are currently unaware of. > > > > > > > On Mon, Feb 13, 2017 at 2:47 PM, David Kincaid <[email protected]> > wrote: > > The example in DRILL-5183 is just a very small toy example to demonstrate > > the bug with how Drill reads Parquet array fields. It doesn't have > anything > > to do with this planning issue (at least I don't think it does). Sorry > if I > > confused things with that reference. > > > > I just tried running our query directly against the table at > > dfs.`parquet/transaction` and get the same result (12 minutes of planning > > time). I disabled the cp and s3 storage plugins that were enabled so that > > only the dfs storage plugin is enabled and the result is the same. > > > > Is this expected for Drill to take this long in the planning phase for a > > query? Is there anything else I can try or information I could provide to > > help identify the bug (seems like a bug to me)? I really appreciate you > > guys helping out so quickly this afternoon. > > > > - Dave > > > > On Mon, Feb 13, 2017 at 4:13 PM, Jinfeng Ni <[email protected]> wrote: > > > >> I downloaded books.parquet from DRILL-5183, and created a view on top > >> of this single parquet file. Then, run EXPLAIN for the query, and it > >> completes within 1.2 seconds on Drill 1.8.0 release. (The # of parquet > >> files would impact the time to fetch metadata. Since it's not the > >> bottleneck in this case, it should not cause a big difference). > >> > >> Do you see the long planning time issue for this query only, or it > >> happens for other queries as well? Besides the possibility of planning > >> rule bugs, we once saw another possible cause of long planning issue. > >> In your storage plugin configuration, if you enable some other storage > >> plugin (for instance, hbase, or hive etc) which are slow to access, > >> then those un-relevant storage plugin might impact your query as well. > >> You may temporarily disable those storage plugins, and see if it's the > >> cause of the problem. > >> > >> 0: jdbc:drill:zk=local> explain plan for > >> . . . . . . . . . . . > select fltb1.sapId, yearmo, > >> . . . . . . . . . . . > COUNT(*) as totalcnt, > >> . . . . . . . . . . . > count(distinct(CASE > >> . . . . . . . . . . . > WHEN > >> . . . . . . . . . . . > (REPEATED_CONTAINS(fltb1. > >> classLabels, > >> . . . . . . . . . . . > > >> 'Thing:Service:MedicalService:Diagnostic:Radiology:Ultrasound.*')) > >> . . . . . . . . . . . > THEN fltb1.invoiceId > >> . . . . . . . . . . . > END)) as ultracount, > >> . . . . . . . . . . . > count(distinct (CASE > >> . . . . . . . . . . . > WHEN > >> . . . . . . . . . . . > (REPEATED_CONTAINS(fltb1. > >> classLabels, > >> . . . . . . . . . . . > > >> 'Thing:Service:MedicalService:Diagnostic:LaboratoryTest.*')) > >> . . . . . . . . . . . > THEN fltb1.invoiceId > >> . . . . . . . . . . . > END)) as labcount > >> . . . . . . . . . . . > from ( > >> . . . . . . . . . . . > select sapid, invoiceId, > >> . . . . . . . . . . . > TO_CHAR(TO_TIMESTAMP(transactionDate, > >> 'YYYY-MM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM') yearmo, > >> . . . . . . . . . . . > classLabels > >> . . . . . . . . . . . > from dfs.tmp.transactionView) fltb1 > >> . . . . . . . . . . . > group by fltb1.sapId, yearmo; > >> +------+------+ > >> | text | json | > >> +------+------+ > >> | 00-00 Screen > >> 00-01 Project(sapId=[$0], yearmo=[$1], totalcnt=[$2], > >> ultracount=[$3], labcount=[$4]) > >> .................................... > >> 00-09 SelectionVectorRemover > >> 00-12 Sort(sort0=[$0], sort1=[$1], dir0=[ASC], > dir1=[ASC]) > >> 00-15 HashAgg(group=[{0, 1}], totalcnt=[COUNT()]) > >> ................................ > >> 00-22 Scan(groupscan=[ParquetGroupScan > >> [entries=[ReadEntryWithPath [path=file:/tmp/parquet/transaction]], > >> selectionRoot=file:/tmp/parquet/transaction, numFiles=1, > >> usedMetadataFile=false, columns=[`sapId`, `invoiceId`, > >> `transactionDate`, `classLabels`.`array`]]]) > >> > >> 1 row selected (1.195 seconds) > >> > >> > >> On Mon, Feb 13, 2017 at 1:51 PM, David Kincaid <[email protected]> > >> wrote: > >> > Here is the entire transactionView.view.drill file. As you can see the > >> view > >> > itself is very simple and is just wrapping a syntactic problem with > the > >> > array field. That's an issue I reported in Jira under DRILL-5183 ( > >> > https://issues.apache.org/jira/browse/DRILL-5183) > >> > > >> > { > >> > "name" : "transactionView", > >> > "sql" : "SELECT `transactionRowKey`, `sapId`, `practiceName`, > >> > `practiceCity`, `practiceState`, `practicePostalCode`, `animalId`, > >> > `dateOfBirth`, `species`, `breed`, `gender`, `status`, `ownerId`, > >> > `itemType`, `classification`, `subclass`, `practiceDescription`, > >> > `clientDescription`, `invoiceId`, `unitOfMeasure`, `vendorName`, > >> `vaccine`, > >> > `rabies`, `vaccineType`, `price`, `quantity`, `transactionDate`, > >> > `visitReason`, `speciesCode`, `genderCode`, > `t`.`classLabels`['array'] AS > >> > `classLabels`\nFROM `dfs`.`/parquet/transaction` AS `t`", > >> > "fields" : [ { > >> > "name" : "transactionRowKey", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "sapId", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "practiceName", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "practiceCity", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "practiceState", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "practicePostalCode", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "animalId", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "dateOfBirth", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "species", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "breed", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "gender", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "status", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "ownerId", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "itemType", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "classification", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "subclass", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "practiceDescription", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "clientDescription", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "invoiceId", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "unitOfMeasure", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "vendorName", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "vaccine", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "rabies", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "vaccineType", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "price", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "quantity", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "transactionDate", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "visitReason", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "speciesCode", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "genderCode", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > }, { > >> > "name" : "classLabels", > >> > "type" : "ANY", > >> > "isNullable" : true > >> > } ], > >> > "workspaceSchemaPath" : [ ] > >> > } > >> > > >> > On Mon, Feb 13, 2017 at 3:47 PM, Jinfeng Ni <[email protected]> wrote: > >> > > >> >> Yes, the log confirmed that the planning, especially physical > >> >> planning, is the one that took most of the time. > >> >> > >> >> If the definition of view s3.cisexport.transactionView is not very > >> >> complicated (involves large # of tables), then it's possible that > some > >> >> planner rules have a bug. (In the past, we once saw couple of planner > >> >> rules would be fired in a loop). > >> >> > >> >> Is it possible that you can share the DDL of the view? That may help > >> >> us re-produce the problem and take a look at the trace of Calcite, > >> >> which Drill uses as the query planner. > >> >> > >> >> > >> >> > >> >> > >> >> > >> >
