Drill join performance

Dmitry Krivov Mon, 22 Feb 2016 10:56:31 -0800

Hello

I have load (as CTAS) into parquet-files StarShema Benchmark generated
csv-data (scale factor 50)


For one of bencmark query's like :

select
d.d_year,
c.c_region,
sum(l.lo_extendedprice*l.lo_discount) as revenue
from dfs.tpch.lineorder_part l,
       dfs.tpch.dates d,
       dfs.tpch.customer c
where l.lo_orderdate = d.d_datekey
         and l.lo_custkey = c.c_custkey
         and d.d_year=1995
group by d.d_year, c.c_region
order by d.d_year desc, c.c_region asc;

got min. exec time of 59 sec.

Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column (2406
partitions in related parquet-files)
Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables
not partitioned

Drill 1.5 conf. have :

drill-env.sh :
DRILL_MAX_DIRECT_MEMORY="16G"
DRILL_HEAP="8G"

sys.options changed  :

planner.memory.max_query_memory_per_node = 8 000 000 000
planner.memory_limit = 1 000 000 000
planner.width.max_per_node = 16 (was 12 by default)

Drill is installed on 16VCPU Linux VM and under query runtime all 16VCPU's
100% utilized.

Is there any chance to improve this query exectime  (my be with some
additional sys.options changes) ?

Thank's!

P.S. Just two days as starting to learn and test Apache Drill

Best regards,
Dmitry

Drill join performance

Reply via email to