Just for info : After recreating tables with explicit columns CASTing have double performace of this query (from 60 to 35 sec.)
Best regards, Dmitry > Hello > > I have load (as CTAS) into parquet-files StarShema Benchmark generated > csv-data (scale factor 50) > > For one of bencmark query's like : > > select > d.d_year, > c.c_region, > sum(l.lo_extendedprice*l.lo_discount) as revenue > from dfs.tpch.lineorder_part l, > dfs.tpch.dates d, > dfs.tpch.customer c > where l.lo_orderdate = d.d_datekey > and l.lo_custkey = c.c_custkey > and d.d_year=1995 > group by d.d_year, c.c_region > order by d.d_year desc, c.c_region asc; > > got min. exec time of 59 sec. > > Table LINEORDER have 300M rows and partitioned by LO_ORDERDATE column (2406 > partitions in related parquet-files) > Table CUSTOMER have 1.5M rows and table DATES have 2556 rows, both tables > not partitioned > > Drill 1.5 conf. have : > > drill-env.sh : > DRILL_MAX_DIRECT_MEMORY="16G" > DRILL_HEAP="8G" > > sys.options changed : > > planner.memory.max_query_memory_per_node = 8 000 000 000 > planner.memory_limit = 1 000 000 000 > planner.width.max_per_node = 16 (was 12 by default) > > Drill is installed on 16VCPU Linux VM and under query runtime all 16VCPU's > 100% utilized. > > Is there any chance to improve this query exectime (my be with some > additional sys.options changes) ? > > Thank's! > > P.S. Just two days as starting to learn and test Apache Drill > > Best regards, > Dmitry >
