Hello
I have load (as CTAS) into parquet-files StarShema Benchmark generated
csv-data (scale factor 50)
For one of bencmark query's like :
select
d.d_year,
c.c_region,
sum(l.lo_extendedprice*l.lo_discount) as revenue
from dfs.tpch.lineorder_part l,
dfs.tpch.dates d,
dfs.tpch.customer c
where l.lo_orderdate = d.d_datekey
and l.lo_custkey = c.c_custkey
and d.d_year=1995
group by d.d_year, c.c_region
order by d.d_year desc, c.c_region asc;
got min. exec time of 59 sec.
Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column (2406
partitions in related parquet-files)
Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables
not partitioned
Drill 1.5 conf. have :
drill-env.sh :
DRILL_MAX_DIRECT_MEMORY="16G"
DRILL_HEAP="8G"
sys.options changed :
planner.memory.max_query_memory_per_node = 8 000 000 000
planner.memory_limit = 1 000 000 000
planner.width.max_per_node = 16 (was 12 by default)
Drill is installed on 16VCPU Linux VM and under query runtime all 16VCPU's
100% utilized.
Is there any chance to improve this query exectime (my be with some
additional sys.options changes) ?
Thank's!
P.S. Just two days as starting to learn and test Apache Drill
Best regards,
Dmitry