RE: Drill join performance

Dmitry Krivov Sat, 19 Mar 2016 02:41:45 -0700

Just for info :

After recreating tables with explicit columns CASTing  have double
performace of this query (from  60 to 35 sec.)


Best regards,
Dmitry

> Hello
>
> I have load (as CTAS) into parquet-files StarShema Benchmark generated
> csv-data (scale factor 50)
>
> For one of bencmark query's like :
>
> select
> d.d_year,
> c.c_region,
> sum(l.lo_extendedprice*l.lo_discount) as revenue
> from dfs.tpch.lineorder_part l,
>        dfs.tpch.dates d,
>        dfs.tpch.customer c
> where l.lo_orderdate = d.d_datekey
>          and l.lo_custkey = c.c_custkey
>          and d.d_year=1995
> group by d.d_year, c.c_region
> order by d.d_year desc, c.c_region asc;
>
> got min. exec time of 59 sec.
>
> Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column (2406
> partitions in related parquet-files)
> Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables
> not partitioned
>
> Drill 1.5 conf. have :
>
> drill-env.sh :
> DRILL_MAX_DIRECT_MEMORY="16G"
> DRILL_HEAP="8G"
>
> sys.options changed  :
>
> planner.memory.max_query_memory_per_node = 8 000 000 000
> planner.memory_limit = 1 000 000 000
> planner.width.max_per_node = 16 (was 12 by default)
>
> Drill is installed on 16VCPU Linux VM and under query runtime all 16VCPU's
> 100% utilized.
>
> Is there any chance to improve this query exectime  (my be with some
> additional sys.options changes) ?
>
> Thank's!
>
> P.S. Just two days as starting to learn and test Apache Drill
>
> Best regards,
> Dmitry
>

RE: Drill join performance

Reply via email to