Hi,
I am running a query:
CREATE TABLE `s3://temp.parquet`
PARTITION BY (`fact`)
AS
WITH facts AS (
SELECT col1, col2, col3
FROM TABLE(`s3://...facts.csv`
(type => 'text', fieldDelimiter => '|', extractHeader => true))
WHERE ...
)
,dimension1 AS (
SELECT col1, col2, col3
FROM TABLE(`s3://...demension1.csv`
(type => 'text', fieldDelimiter => '|', extractHeader => true))
)
,dimension2 AS (
SELECT col1, col2, col3
FROM TABLE(`s3://...demension2.csv`
(type => 'text', fieldDelimiter => '|', extractHeader => true))
)
SELECT
`col1, co2, etc
FROM facts
INNER JOIN ...
INNER JOIN ...
Drill is 1.12.0 running on AWS EMR (3 x r4.2xlarge, I would add more nodes
but as per below is only running a single fragment). Other queries do run
correctly with multiple fragments.
I have a couple of questions please:
1. The only way I can get the query to use all available memory (as defined
by planner.memory.max_query_memory_per_node = n) is to
set planner.memory.min_memory_per_buffered_op = n (i.e. the same as
planner.memory.max_query_memory_per_node. I can only see limited
documentation on the set planner.memory.min_memory_per_buffered_op, is this
expected behaviour?
2. The query irrespective of the file sizes always runs as a single thread.
Given the files range from ~100MB to ~100GB this means it can take
significant time (hours). Is this due to a limitation of querying data from
s3 or can this be optimized?
I have been setting the following based on documents I have reviewed on
optimizing Drill:
a. exec.min_hash_table_size
b. planner.width.max_per_node
c. planner.memory.max_query_memory_per_node
d. planner.memory.min_memory_per_buffered_op
e. planner.memory_limit
Is there any other likely "constraint" I should be looking at?
Is there a"query constraints check" flow process I could be working through
to optimize my query (I haven't found one)? If there isn't a generic one I
will happy write on up based on my working.
Regards,
Francis