Hi Matthias The waiting time for a PARQUET_ROW_GROUP_SCAN operator is the total time that all the fragments took to read the parquet data into memory as Drill's Value Vectors. So, 80 seconds would indicate that the bulk of the time is spent in just getting the data.
If you scroll down to the operator specific table.. you;ll find an entry on the lines of 09-xx-01 - PARQUET_ROW_GROUP_SCAN Within that collapsed table, you should find at the end a sub section for Operator Metrics. These metrics should be able to tell you where time is being spent the most on a per-fragment level. If the metrics are missing, that means the traditional Parquet reader was used instead of Drill's fast native parquet reader (Drill does this if it encounters parquet files with Nested data) and the time is being spent by the Parquet library in deserializing the file. In this case, you're out of luck and your best bet is to split the parquet file into multiple files or atleast multiple rowgroups. That way, Drill can create more fragments (assuming you've not maxed out that limit) and read the data in parallel. ~ Kunal On 11/29/2018 9:13:45 AM, Rosenthaler Matthias (PS-DI/ETF1.1) <[email protected]> wrote: Hi, I am using apache drill to query huge parquet files (100 MB) on a single node. A SELECT * query takes around 90 Seconds. 80 Seconds are "Waiting time". Can you explain what this waiting time means and how I am able to optimize it? Mit freundlichen Grüßen / Best regards Matthias Rosenthaler Powertrain Solutions, Engine Testing (PS-DI/ETF1.1) Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA | www.bosch.at Tel. +43 732 7667-479 | [email protected] Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 , Registergericht: FN 55722 w HG-Wien Aufsichtsratsvorsitzender: Dr. Uwe Thomas; Geschäftsführung: Dr. Klaus Peter Fouquet DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 140/4988
