Hi Matthias

The waiting time for a PARQUET_ROW_GROUP_SCAN operator is the total time that 
all the fragments took to read the parquet data into memory as Drill's Value 
Vectors. So, 80 seconds would indicate that the bulk of the time is spent in 
just getting the data. 

If you scroll down to the operator specific table.. you;ll find an entry on the 
lines of 
09-xx-01 - PARQUET_ROW_GROUP_SCAN

Within that collapsed table, you should find at the end a sub section for 
Operator Metrics. 

These metrics should be able to tell you where time is being spent the most on 
a per-fragment level. 

If the metrics are missing, that means the traditional Parquet reader was used 
instead of Drill's fast native parquet reader (Drill does this if it encounters 
parquet files with Nested data) and the time is being spent by the Parquet 
library in deserializing the file. In this case, you're out of luck and your 
best bet is to split the parquet file into multiple files or atleast multiple 
rowgroups. That way, Drill can create more fragments (assuming you've not maxed 
out that limit) and read the data in parallel.

~ Kunal 
On 11/29/2018 9:13:45 AM, Rosenthaler Matthias (PS-DI/ETF1.1) 
<[email protected]> wrote:
Hi,

I am using apache drill to query huge parquet files (100 MB) on a single node.
A SELECT * query takes around 90 Seconds. 80 Seconds are "Waiting time".
Can you explain what this waiting time means and how I am able to optimize it?

Mit freundlichen Grüßen / Best regards

Matthias Rosenthaler

Powertrain Solutions, Engine Testing (PS-DI/ETF1.1)
Robert Bosch AG | Robert-Bosch-Straße 1 | 4020 Linz | AUSTRIA | www.bosch.at
Tel. +43 732 7667-479 | [email protected]

Sitz: Robert Bosch Aktiengesellschaft, A-1030 Wien, Göllnergasse 15-17 , 
Registergericht: FN 55722 w HG-Wien
Aufsichtsratsvorsitzender: Dr. Uwe Thomas; Geschäftsführung: Dr. Klaus Peter 
Fouquet
DVR-Nr.: 0418871- ARA-Lizenz-Nr.: 1831 - UID-Nr.: ATU14719303 - Steuernummer 
140/4988


Reply via email to