Correction for item 1: Change "FAA" => "FCD1".

On Tue, Apr 21, 2015 at 3:11 PM, Hao Zhu <[email protected]> wrote:

> Hi Team,
>
> Besides the 300+ seconds planning time, here are the performance
> differences in execution phase also.
>
> In general, this SQL contains 2 fact table joins, and with another 10+
> dimension table joins.
> The 2 fact tables are:
>
>    - fact_agent_activity_detail_12m_partparq AS FAA
>    - fact_contact_detail_12m_partparq AS FCD1
>
> So here are the 2 major time differences:
> *1. PARQUET_ROW_GROUP_SCAN for FAA.*
> Impala spent 3s248ms to read FAA.
> Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows
> Peak Mem  Est. Peak Mem  Detail
> 15:HASH JOIN           4    2s728ms    2s774ms  145.81K      89.61K
>  531.03 MB        2.68 KB  RIGHT OUTER JOIN, PARTITIONED
> |--30:EXCHANGE         4    3.135ms   12.277ms  145.81K         192
>    0              0  HASH(FAA.Data_Source_Key,FA...
> |  00:SCAN HDFS        1    3s248ms    3s248ms  145.81K         192
>  107.93 MB      880.00 MB  default.fact_agent_activity...
> 29:EXCHANGE            4      9.8ms    9.826ms  399.94K      89.61K
>    0              0  HASH(FCD1.data_source_key,F...
> 01:SCAN HDFS           4    2s648ms    3s452ms  399.94K      89.61K
> 55.96 MB      960.00 MB  default.fact_contact_detail...
>
> Drill spent 26.934 seconds(Avg Process time) to read FAA.
> 02-xx-05 PARQUET_ROW_GROUP_SCAN 26.934
> 02-05       Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_0_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_10_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_11_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_1_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_2_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_3_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_4_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_5_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_6_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_7_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_8_0.parquet],
> ReadEntryWithPath [path=hdfs://
> 10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_9_0.parquet]
> <http://10.153.25.119:8020/user/root/fact_contact_detail_12m_partparq/2014/12/1_9_0.parquet%5D>],
> selectionRoot=/user/root/fact_contact_detail_12m_partparq, numFiles=12,
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, Call_Start_Date_Time, dir0,
> dir1]): rowcount = 8960530.0, cumulative cost = {8960530.0 rows, 3.584212E7
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 606703
>
> Could you compare the performance of below 2 SQLs:
> * Impala:*
>  select count(*) from
>  (select * from fact_contact_detail_12m_partparq AS FCD1 where
> Call_Start_Date_Time between '2014-12-03 00:00:00' AND '2014-12-03
> 13:00:00' AND FCD1.Year_key = 20140000 and FCD1.month_key = 20141200) tmp;
> * Drill:*
>  select count(*) from
>  (select * from fact_contact_detail_12m_partparq AS FCD1 where
> Call_Start_Date_Time between '2014-12-03 00:00:00' AND '2014-12-03
> 13:00:00' AND FCD1.dir0 = 2014 and FCD1.dir1 = 12) tmp;
>
> *2. Drill is using broadcast join for the 2 fact table while impala is
> using shuffle join.*
>
> Impala's shuffle join for 2 fact tables took 2s728ms.
> 15:HASH JOIN           4    2s728ms    2s774ms  145.81K      89.61K
>  531.03 MB        2.68 KB  RIGHT OUTER JOIN, PARTITIONED
>
> Drill's broadcast join for 2 fact tables caused 16.035 seconds avg waiting
> time.
> 01-xx-46 UNORDERED_RECEIVER 16.035
>
> 01-46    BroadcastExchange : rowType = RecordType(ANY $f1, ANY $f2, ANY
> $f3, ANY ITEM, ANY ITEM4, ANY ITEM5, ANY ITEM6, ANY ITEM7, ANY ITEM8, ANY
> ITEM9, ANY ITEM10, ANY ITEM11, ANY ITEM12, ANY ITEM13, ANY ITEM14, ANY
> ITEM15): rowcount = 50402.98125, cumulative cost = {2.7032798943749998E7
> rows, 1.0056522825625E8 cpu, 0.0 io, 1.32128391168E10 network, 0.0
> memory}, id = 606708
>
> The reason why Drill chose broadcast is because it estimates row count is
> only 50K(Below planner.broadcast_threshold, default 1000K). However the
> actual row count to broadcast is much higher than estimation.
>
> Needs to check why the estimation is not that accurate in this case.
>
> Thanks,
> Hao
>
> On Tue, Apr 21, 2015 at 2:32 PM, Jinfeng Ni <[email protected]> wrote:
>
>> The query you used is a 15-table join query. We know that Drill's cost
>> based optimizer will see performance overhead increase significantly with
>> increased # of tables joined, due to the increased search space.  I'm not
>> surprised to see that you had 306.812 seconds for planning for such 15
>> table join.
>>
>> We plan to implement a heuristic planner in Drill, which will provide
>> another option, for large # of table join query.  Heuristic planner should
>> help in this case.
>>
>>
>> On Tue, Apr 21, 2015 at 1:48 PM, Sivasubramaniam, Latha <
>> [email protected]> wrote:
>>
>> > Hao,
>> >
>> > I have copied both query profiles to the link
>> >
>> https://drive.google.com/folderview?id=0ByB1-EsAGxA8fkRCSDhVcGtQS0NGTjRuSlpGelVLMkxIVVFxYXMtU2JtQ3FaN2t3UTZpUUE&usp=sharing
>> >
>> > Please let me know if you cannot access.
>> >
>> > Appreciate any help.
>> >
>> > Thanks,
>> > Latha
>> >
>> > -----Original Message-----
>> > From: Hao Zhu [mailto:[email protected]]
>> > Sent: Tuesday, April 21, 2015 11:31 AM
>> > To: [email protected]
>> > Subject: Re: Query performance Comparison - Drill and Impala
>> >
>> > Hi Latha,
>> >
>> > Do you have the complete SQL profiles for the same query on impala and
>> > Drill?
>> > For Impala, you can run "profile;" command after the SQL finished.
>> > For Drill, you can go to the web GUI, and copy/paste the complete "Full
>> > JSON Profile".
>> >
>> > I want to compare what is the major performance difference between them
>> > two.
>> >
>> > Thanks,
>> > Hao
>> >
>> > On Tue, Apr 21, 2015 at 10:43 AM, Sivasubramaniam, Latha <
>> > [email protected]> wrote:
>> >
>> > > Neeraja, I removed some columns from select list and shortened aliases
>> > > to get the query manageable and got the explain plan working finally.
>> > > The errors were really random.
>> > >
>> > > Jacques,
>> > >
>> > > Most of it seems to be Hashjoin and yes the planning time took 306
>> > > seconds.  Is there a way to improve the query performance.? Below is
>> > > the snapshot of what I see in the query plan and I have copied the
>> > > query also to the end of the email.
>> > >
>> > >
>> > > Resource_Group_Desc=[$7], Call_Action_Desc=[$20], ITEM7=[$9],
>> > > ITEM8=[$10], ITEM9=[$11], ITEM10=[$12], ITEM11=[$13], ITEM12=[$14],
>> > > ITEM13=[$15], ITEM0=[$16], ITEM14=[$17], ITEM15=[$18])
>> > > 01-29
>> > > HashJoin(condition=[=($19, $8)], joinType=[left])
>> > > 01-31
>> > > Project(Call_Id_Source=[$0], Agent_Key=[$1], Site_Key=[$2],
>> > > Workgroup_Key=[$3], Status_Key=[$4], Reason_Key=[$5],
>> > > Call_Type_Desc=[$6], Resource_Group_Desc=[$20], ITEM6=[$8],
>> > > ITEM7=[$9], ITEM8=[$10], ITEM9=[$11], ITEM10=[$12], ITEM11=[$13],
>> > > ITEM12=[$14], ITEM13=[$15], ITEM0=[$16], ITEM14=[$17], ITEM15=[$18])
>> > > 01-32
>> > > HashJoin(condition=[=($7, $19)], joinType=[left])
>> > > 01-34
>> > > Project(Call_Id_Source=[$0], Agent_Key=[$1], Site_Key=[$2],
>> > > Workgroup_Key=[$3], Status_Key=[$4], Reason_Key=[$5],
>> > > Call_Type_Desc=[$20], ITEM5=[$7], ITEM6=[$8], ITEM7=[$9], ITEM8=[$10],
>> > > ITEM9=[$11], ITEM10=[$12], ITEM11=[$13], ITEM12=[$14], ITEM13=[$15],
>> > > ITEM0=[$16], ITEM14=[$17],
>> > > ITEM15=[$18])
>> > > 01-35
>> > > HashJoin(condition=[=($6, $19)], joinType=[left])
>> > > 01-37
>> > > Project(Call_Id_Source=[$0], Agent_Key=[$1], Site_Key=[$2],
>> > > Workgroup_Key=[$3], Status_Key=[$4], Reason_Key=[$5], ITEM4=[$7],
>> > > ITEM5=[$8], ITEM6=[$9], ITEM7=[$10], ITEM8=[$11], ITEM9=[$12],
>> > > ITEM10=[$13], ITEM11=[$14], ITEM12=[$15], ITEM13=[$16], ITEM0=[$17],
>> > > ITEM14=[$18], ITEM15=[$19])
>> > > 01-38
>> > > HashJoin(condition=[=($6, $20)], joinType=[left])
>> > > 01-40
>> > > Project(Call_Id_Source=[$0], Agent_Key=[$2], Site_Key=[$3],
>> > > Workgroup_Key=[$4], Status_Key=[$5], Reason_Key=[$6], ITEM=[$7],
>> > > ITEM4=[$8], ITEM5=[$9], ITEM6=[$10], ITEM7=[$11], ITEM8=[$12],
>> > > ITEM9=[$13], ITEM10=[$14], ITEM11=[$15], ITEM12=[$16], ITEM13=[$17],
>> > > ITEM0=[$7], ITEM14=[$18], ITEM15=[$19])
>> > > 01-41
>> > > HashJoin(condition=[=($20, $1)], joinType=[left])
>> > > 01-43
>> > > Project(Call_Id_Source=[$1], Service_Key=[$3], Agent_Key=[$4],
>> > > Site_Key=[$5], Workgroup_Key=[$6], Status_Key=[$7], Reason_Key=[$8],
>> > > ITEM=[$15], ITEM4=[$16], ITEM5=[$17], ITEM6=[$18], ITEM7=[$19],
>> > > ITEM8=[$20], ITEM9=[$21], ITEM10=[$22], ITEM11=[$23], ITEM12=[$24],
>> > > ITEM13=[$25], ITEM14=[$26], ITEM15=[$27])
>> > > 01-45
>> > > HashJoin(condition=[AND(=($0, $12), =($1, $13), =($2, $14))],
>> > > joinType=[left])
>> > > 01-47
>> > > SelectionVectorRemover
>> > > 01-48
>> > > Filter(condition=[AND(>=($9, '2014-12-03 00:00:00'), <=($9,
>> > > '2014-12-03 13:00:00'), =($10, 2014), =($11, 12))])
>> > > 01-49
>> > > Project(Data_Source_Key=[$9], Call_Id_Source=[$4],
>> > > Sequence_Number=[$8], Service_Key=[$11], Agent_Key=[$10],
>> > > Site_Key=[$2], Workgroup_Key=[$5], Status_Key=[$3], Reason_Key=[$7],
>> > > Start_Date_Time=[$6], dir0=[$1],
>> > > dir1=[$0])
>> > > 01-50
>> > >   Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
>> > > [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_0_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_1_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_2_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_3_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_4_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_5_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_6_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_7_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_8_0.parquet],
>> > > ReadEntryWithPath [path=hdfs://
>> > >
>> 10.153.25.119:8020/user/root/fact_agent_activity_detail_12m_partparq/2
>> > > 014/12/1_9_0.parquet]],
>> > > selectionRoot=/user/root/fact_agent_activity_detail_12m_partparq,
>> > > numFiles=10, columns=[`Data_Source_Key`, `Call_Id_Source`,
>> > > `Sequence_Number`, `Service_Key`, `Agent_Key`, `Site_Key`,
>> > > `Workgroup_Key`, `Status_Key`, `Reason_Key`, `Start_Date_Time`,
>> > > `dir0`, `dir1`]]])
>> > > 01-46
>> > > BroadcastExchange
>> > > 02-01
>> > > Project($f1=[ITEM($0, 'Data_Source_Key')], $f2=[ITEM($0,
>> > > 'Call_Id_Source')], $f3=[ITEM($0, 'Sequence_Number')], ITEM=[ITEM($0,
>> > > 'Service_Key')], ITEM4=[ITEM($0, 'Call_Type_Key')], ITEM5=[ITEM($0,
>> > > 'Resource_Group_Key')], ITEM6=[ITEM($0, 'Call_Action_Key')],
>> > > ITEM7=[ITEM($0, 'Agent_Disp_Key')], ITEM8=[ITEM($0,
>> > > 'Second_Party_Agent_Key')], ITEM9=[ITEM($0,
>> > > 'Call_Action_Reason_Key')], ITEM10=[ITEM($0, 'ANI')], ITEM11=[ITEM($0,
>> > > 'DNIS')], ITEM12=[ITEM($0, 'Begin_Date_Time')], ITEM13=[ITEM($0,
>> > > 'Sequence_Number')], ITEM14=[ITEM($0, 'Dur_Short_Call')],
>> > > ITEM15=[ITEM($0, 'Call_Category_Key')])
>> > > 02-02
>> > > SelectionVectorRemover
>> > > 02-03
>> > >   Filter(condition=[AND(>=($1, '2014-12-03 00:00:00'), <=($1,
>> > > '2014-12-03 13:00:00'), =($2, 2014), =($3, 12))])
>> > > 02-04
>> > >     Project(T8¦¦*=[$0], Call_Start_Da |
>> > > +------------+------------+
>> > > 1 row selected (306.812 seconds)
>> > > 0: jdbc:drill:zk=rtr-poc-imp1:2181>
>> > >
>> > >
>> > > ************ Query ******************** SELECT DSQ.Site_Desc,
>> > > DWAE.Workgroup_Desc, DASAE.Agent_Status_Key, DCTCD.Call_Type_Desc,
>> > > DCA.Call_Action_Desc, FCD.ANI, FCD.DNIS, FCD.Begin_Date_Time,
>> > > DD.Disposition_Desc, DAAE.Full_Name ,
>> > >
>> > > DRGAE.Resource_Group_Desc, FCD.Sequence_Number, FCD.Service_Key,
>> > > FCD.Dur_Short_Call, FAA.Call_Id_Source, FCD.Call_Category_Key  FROM
>> > > fact_agent_activity_detail_12m_partparq AS FAA LEFT OUTER JOIN (select
>> > > * from fact_contact_detail_12m_partparq AS FCD1 where
>> > > Call_Start_Date_Time between '2014-12-03 00:00:00' AND '2014-12-03
>> > > 13:00:00' AND FCD1.dir0 = 2014 and FCD1.dir1 = 12
>> > >
>> > > ) AS FCD
>> > > ON FAA.Data_Source_Key = FCD.Data_Source_Key AND  FAA.Call_Id_Source =
>> > > FCD.Call_Id_Source AND FAA.Sequence_Number = FCD.Sequence_Number LEFT
>> > > OUTER JOIN DIM_Services_Parq AS DSAAD ON DSAAD.Service_Key =
>> > > FAA.Service_Key LEFT OUTER JOIN DIM_Services_Parq AS DSCD ON
>> > > FCD.Service_Key = DSCD.Service_Key LEFT OUTER JOIN DIM_Call_Types_Parq
>> > > AS DCTCD ON FCD.Call_Type_Key = DCTCD.Call_Type_Key LEFT OUTER JOIN
>> > > DIM_Resource_Groups_parq AS DRGAE ON FCD.Resource_Group_Key =
>> > > DRGAE.Resource_Group_Key LEFT OUTER JOIN DIM_Call_Actions_parq AS DCA
>> > > ON DCA.Call_Action_Key = FCD.Call_Action_Key INNER JOIN
>> > > DIM_Agents_Parquet AS DAAE ON FAA.Agent_Key = DAAE.Agent_Key INNER
>> > > JOIN DIM_Sites_Parq AS DSQ ON FAA.Site_Key = DSQ.Site_Key INNER JOIN
>> > > DIM_Workgroups_parq AS DWAE ON FAA.Workgroup_Key = DWAE.Workgroup_Key
>> > > LEFT OUTER JOIN    DIM_Dispositions_parq AS DD ON FCD.Agent_Disp_Key =
>> > > DD.Disposion_Key
>> > > LEFT OUTER JOIN DIM_Agents_Parquet AS DACD ON
>> > > FCD.Second_Party_Agent_Key = DACD.Agent_Key LEFT OUTER JOIN
>> > > DIM_Call_Action_Reasons_parq AS DAAR ON FCD.Call_Action_Reason_Key =
>> > > DAAR.Call_Action_Reason_Key LEFT OUTER JOIN DIM_Agent_Status_parq AS
>> > > DASAE ON FAA.Status_Key = DASAE.Agent_Status_Key LEFT OUTER JOIN
>> > > DIM_Agent_Status_Reasons_parq AS DARAE ON FAA.Reason_Key =
>> > > DARAE.Agent_Status_Reason_Key WHERE (DWAE.Workgroup_Key IN ( 3,1)
>> > >
>> > >  AND
>> > > (FAA.Start_Date_Time between '2014-12-03 00:00:00' AND '2014-12-03
>> > > 13:00:00' AND FAA.dir0 = 2014 and FAA.dir1 = 12));
>> > >
>> > >
>> > > Appreciate your help.
>> > >
>> > > Thanks,
>> > > Latha
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Neeraja Rentachintala [mailto:[email protected]]
>> > > Sent: Monday, April 20, 2015 9:22 PM
>> > > To: [email protected]
>> > > Subject: Re: Query performance Comparison - Drill and Impala
>> > >
>> > > Latha
>> > > What error are you getting when doing explain plan.
>> > >
>> > > On Mon, Apr 20, 2015 at 4:40 PM, Sivasubramaniam, Latha <
>> > > [email protected]> wrote:
>> > >
>> > > > I am unable to do an explain plan for this query.  Is there a way to
>> > > > pass a sql file for the Explain plan command. Query runs fine from
>> > > > the file with the !run command though. Will try to fix the errors
>> > > > and see if I can get the query plan.
>> > > >
>> > > > Thanks,
>> > > > latha
>> > > >
>> > > > -----Original Message-----
>> > > > From: Jacques Nadeau [mailto:[email protected]]
>> > > > Sent: Monday, April 20, 2015 4:12 PM
>> > > > To: [email protected]
>> > > > Subject: Re: Query performance Comparison - Drill and Impala
>> > > >
>> > > > Can you confirm you are getting broadcast joins for all the
>> > > > dimension tables? Additionally, can you confirm the planning time?
>> > > > I'm guessing both might be an issue given the times you are seeing.
>> > > > On Apr 20, 2015 6:46 PM, "Sivasubramaniam, Latha" <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > I am running performance tests for our use cases on Impala and
>> > > > > Drill to make a choice and pick one for our Data warehouse
>> > > > > implementation on top of Hadoop.  I used the same cluster for both
>> > the tests.
>> > > > > Select Query that takes ~20 seconds in impala, takes around 440
>> > > > > seconds in drill on same hardware cluster (nearly 20 times more).
>> > > > > I would like to make sure that I am not missing anything in tuning
>> > > > > or
>> > > configuration.
>> > > > > Below is our configuration, appreciate comments on if these
>> > > > > performance numbers are as expected or if any tuning can be done
>> > > > > to
>> > > > improve drill performance.
>> > > > >
>> > > > > Hadoop Cluster configuration and data set:
>> > > > >
>> > > > >
>> > > > > *         5 nodes cluster, 1 name node and 4 data nodes, 32GB
>> memory
>> > ,
>> > > 4
>> > > > > cpus on each node.
>> > > > >
>> > > > > *         There are two main fact tables. 98 million records in
>> one
>> > and
>> > > > 87
>> > > > > million records in another. There are around 10 dimension tables.
>> > > > > For now, we are running the tests with normalized schema from out
>> > > > > data
>> > > > warehouse.
>> > > > >
>> > > > > *         Running the same query against Impala and Drill except
>> for
>> > > the
>> > > > > syntax differences everything is same.
>> > > > >
>> > > > > Impala Configuration
>> > > > >
>> > > > >
>> > > > > *         CDH 5.3.2 with Impala 2.1.2 version
>> > > > >
>> > > > > *         4 impala daemons running on 4 data nodes.
>> > > > >
>> > > > > *          Impala daemons memory set to 25GB.
>> > > > >
>> > > > > *         Partitioned parquet tables were created from text format
>> > hdfs
>> > > > > files using impala itself (CTAS command).
>> > > > >
>> > > > > *         Parquet block size 256MB
>> > > > >
>> > > > > Drill Configuration
>> > > > >
>> > > > >
>> > > > > *         Drill 0.8 version, added drill on the CDH cluster setup
>> > > > >
>> > > > > *         4 drillbits running on 4 data nodes
>> > > > >
>> > > > > *         Drillbits memory set to 16GB and heap memory set to
>> 8GB.  I
>> > > ran
>> > > > > with 8GB memory and 4GB for heap, there was no change in the query
>> > > > timings.
>> > > > >
>> > > > > *         Partitioned parquet files.
>> > > > >
>> > > > > SQL Query
>> > > > >
>> > > > > SELECT
>> > > > >
>> > > > >                 DIM_Sites_parq.Site_Desc,
>> > > > >
>> > > > >                 DIM_Workgroups_AgentEventDetails.Workgroup_Desc,
>> > > > >                 DIM_Agents_AgentEventDetails.Last_Name,
>> > > > >
>> > > > >                 DIM_Agents_AgentEventDetails.First_Name,
>> > > > >
>> > > > >
>> > > > >
>>  DIM_Agent_Status_AgentEventDetails.Agent_Status_Key,
>> > > > >                 DIM_CallTypes_ContactDetails.Call_Type_Desc,
>> > > > >
>> > > > >
>> > > > >
>> > > > >                 DIM_Call_Actions.Call_Action_Desc,
>> > > > >
>> > > > >                 FACT_Contact_Detail.ANI,
>> > > > >                 FACT_Contact_Detail.DNIS,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Begin_Date_Time,
>> > > > >                 FACT_Contact_Detail.Call_Start_Date_Time,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Orig_DNIS,
>> > > > >                 FACT_Agent_Activity_Detail.Start_Date_Time,
>> > > > >                 FACT_Agent_Activity_Detail.End_Date_Time,
>> > > > >                 FACT_Agent_Activity_Detail.Dur_Time,
>> > > > >
>> > > > >                 FACT_Agent_Activity_Detail.Sequence_Number,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Begin_Date_Time,
>> > > > >                 FACT_Contact_Detail.Call_Start_Date_Time,
>> > > > >                 FACT_Agent_Activity_Detail.Start_Date_Time,
>> > > > >                 FACT_Agent_Activity_Detail.End_Date_Time,
>> > > > >
>> > > > >
>> > > > >                 DIM_Dispositions.Disposition_Desc,
>> > > > >                 DIM_Agents_AgentEventDetails.Full_Name ,
>> > > > >
>> > > > >
>> > > > > DIM_Resource_Groups_AgentEventDetails.Resource_Group_Desc,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Sequence_Number,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Service_Key,
>> > > > >                 FACT_Contact_Detail.Station,
>> > > > >                 FACT_Contact_Detail.Near_Off_Hook_Date_Time,
>> > > > >                 FACT_Contact_Detail.Far_Off_Hook_Date_Time,
>> > > > >                 FACT_Contact_Detail.Near_On_Hook_Date_Time,
>> > > > >                 FACT_Contact_Detail.Far_On_Hook_Date_Time,
>> > > > >                 FACT_Contact_Detail.Begin_Greeting_Date_Time,
>> > > > >
>> > > > >                 FACT_Contact_Detail.End_Guard_Date_Time,
>> > > > >                 FACT_Contact_Detail.Resource_Group_Type,
>> > > > >                 FACT_Contact_Detail.Info_Digits,
>> > > > >                 FACT_Contact_Detail.Agent_Site_Key,
>> > > > >                 FACT_Contact_Detail.Agent_Switch_Key,
>> > > > >
>> > > > >                 FACT_Contact_Detail.Line_Number,
>> > > > >                 FACT_Contact_Detail.Queue_Start_Date_Time,
>> > > > >
>> > > > >                 FACT_Contact_Detail.CallQ_Start_Date_Time,
>> > > > >                 FACT_Contact_Detail.Queue_End_Date_Time,
>> > > > >                 FACT_Contact_Detail.Dur_Queue_Time,
>> > > > >                 FACT_Contact_Detail.Conn_Clear_Date_Time,
>> > > > >                 FACT_Contact_Detail.Wrap_End_Date_Time,
>> > > > >
>> > > > > FACT_Contact_Detail.Answer_Flag,
>> > > > >
>> > > > > FACT_Contact_Detail.Call_End_Date_Time,
>> > > > >  FACT_Contact_Detail.Script_Id_Source,
>> > > > >
>> > > > >   FACT_Contact_Detail.IPNIQ_ASBR_Flag,
>> > > > >  FACT_Contact_Detail.Reserved_Start_Date_Time,
>> > > > >
>> > > > >   FACT_Contact_Detail.Reserved_End_Date_Time,
>> > > > >  FACT_Contact_Detail.Dur_IPNIQ_Q_Time,
>> > > > >  FACT_Contact_Detail.SeqLeg_Number,
>> > > > >
>> > > > > FACT_Contact_Detail.Num_Accept_Responses,
>> > > > >  FACT_Contact_Detail.Num_Reject_Responses,
>> > > > >  FACT_Contact_Detail.Preview_Start_Date_Time,
>> > > > >
>> > > > >   FACT_Contact_Detail.Preview_End_Date_Time,
>> > > > >
>> > > > > FACT_Contact_Detail.Detection_Date_Time,
>> > > > >  FACT_Contact_Detail.Answer_Date_Time,
>> > > > >  FACT_Contact_Detail.Record_Number,
>> > > > >  FACT_Contact_Detail.Detection_Type,
>> > > > >  FACT_Contact_Detail.Calling_Party_Name,
>> > > > >  FACT_Contact_Detail.Num_Phone_Rings,
>> > > > >
>> > > > >   FACT_Contact_Detail.End_Greeting_Date_Time,
>> > > > >  FACT_Contact_Detail.Phone_Start_Ringing_Date_Time,
>> > > > >
>> > > > >   FACT_Contact_Detail.Begin_Greeting_Date_Time,
>> > > > >  FACT_Contact_Detail.Begin_Msg_Playback_Date_Time,
>> > > > >  FACT_Contact_Detail.Xfer_Command_Date_Time,
>> > > > >  FACT_Contact_Detail.Connect_Date_Time,
>> > > > >  FACT_Contact_Detail.Second_Party_Park_Flag,
>> > > > >  FACT_Contact_Detail.Dur_Short_Call,
>> > > > >  FACT_Agent_Activity_Detail.Call_Id_Source,
>> > > > >
>> > > > >   FACT_Contact_Detail.Call_Category_Key,
>> > > > >
>> > > > >   DIM_Agents_AgentEventDetailsSecondParty.User_Id_Source
>> > > > >
>> > > > >   FROM
>> > > > >     hdfs.rootparq.`fact_agent_activity_detail_12m_partparq` AS
>> > > > > FACT_Agent_Activity_Detail
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >
>> > > > >  (select * from hdfs.rootparq.`fact_contact_detail_12m_partparq`
>> > > > > where Call_Start_Date_Time
>> > > > >
>> > > > > between
>> > > > >
>> > > > > '2014-12-03 00:00:00' AND '2014-12-03 13:00:00' AND
>> > > > > fact_contact_detail_12m_partparq.dir0 = 2014 and
>> > > > > fact_contact_detail_12M_partParq.dir1 = 12 ) AS
>> > > > > FACT_Contact_Detail ON
>> > > > >        FACT_Agent_Activity_Detail.Data_Source_Key =
>> > > > > FACT_Contact_Detail.Data_Source_Key
>> > > > >                 AND
>> > > > >        FACT_Agent_Activity_Detail.Call_Id_Source =
>> > > > > FACT_Contact_Detail.Call_Id_Source
>> > > > >                 AND
>> > > > >        FACT_Agent_Activity_Detail.Sequence_Number =
>> > > > > FACT_Contact_Detail.Sequence_Number
>> > > > >
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >                    DIM_Services_Parq AS
>> > > > > DIM_Services_Agent_Activity_Detail
>> > > > > ON DIM_Services_Agent_Activity_Detail.Service_Key =
>> > > > > FACT_Agent_Activity_Detail.Service_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Services_Parq AS DIM_Services_Contact_Detail ON
>> > > > > FACT_Contact_Detail.Service_Key =
>> > > > > DIM_Services_Contact_Detail.Service_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Call_Types_Parq AS DIM_CallTypes_ContactDetails ON
>> > > > >        FACT_Contact_Detail.Call_Type_Key =
>> > > > > DIM_CallTypes_ContactDetails.Call_Type_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Resource_Groups_parq AS
>> > > > > DIM_Resource_Groups_AgentEventDetails
>> > > ON
>> > > > >        FACT_Contact_Detail.Resource_Group_Key =
>> > > > > DIM_Resource_Groups_AgentEventDetails.Resource_Group_Key
>> > > > >
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Call_Actions_parq AS DIM_Call_Actions ON
>> > > > > DIM_Call_Actions.Call_Action_Key =
>> > > > > FACT_Contact_Detail.Call_Action_Key
>> > > > >
>> > > > >                 INNER JOIN
>> > > > >         DIM_Agents_Parquet AS DIM_Agents_AgentEventDetails ON
>> > > > > FACT_Agent_Activity_Detail.Agent_Key =
>> > > > > DIM_Agents_AgentEventDetails.Agent_Key
>> > > > >
>> > > > >                 INNER JOIN
>> > > > >         DIM_Sites_Parq ON FACT_Agent_Activity_Detail.Site_Key =
>> > > > > DIM_Sites_Parq.Site_Key
>> > > > >
>> > > > >                 INNER JOIN
>> > > > >         DIM_Workgroups_parq AS DIM_Workgroups_AgentEventDetails ON
>> > > > >        FACT_Agent_Activity_Detail.Workgroup_Key =
>> > > > > DIM_Workgroups_AgentEventDetails.Workgroup_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Dispositions_parq AS DIM_Dispositions ON
>> > > > > FACT_Contact_Detail.Agent_Disp_Key =
>> > > > > DIM_Dispositions.Disposion_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Agents_Parquet AS DIM_Agents_AgentEventDetailsSecondParty
>> ON
>> > > > >        FACT_Contact_Detail.Second_Party_Agent_Key =
>> > > > > DIM_Agents_AgentEventDetailsSecondParty.Agent_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Call_Action_Reasons_parq AS DIM_Call_Action_Reasons ON
>> > > > > FACT_Contact_Detail.Call_Action_Reason_Key =
>> > > > > DIM_Call_Action_Reasons.Call_Action_Reason_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Agent_Status_parq AS DIM_Agent_Status_AgentEventDetails ON
>> > > > >        FACT_Agent_Activity_Detail.Status_Key =
>> > > > > DIM_Agent_Status_AgentEventDetails.Agent_Status_Key
>> > > > >
>> > > > >                 LEFT OUTER JOIN
>> > > > >     DIM_Agent_Status_Reasons_parq AS
>> > > > > DIM_Agent_Reason_AgentEventDetails
>> > > > ON
>> > > > >        FACT_Agent_Activity_Detail.Reason_Key =
>> > > > > DIM_Agent_Reason_AgentEventDetails.Agent_Status_Reason_Key
>> > > > >
>> > > > >
>> > > > > WHERE
>> > > > >  (
>> > > > >
>> > > > > DIM_Workgroups_AgentEventDetails.Workgroup_Key  IN  ( 3,1)
>> > > > >
>> > > > >   AND
>> > > > >
>> > > > >
>> > > > >    ( FACT_Agent_Activity_Detail.Start_Date_Time
>> > > > >
>> > > > > between
>> > > > > '2014-12-03 00:00:00' AND '2014-12-03 13:00:00' AND
>> > > > > FACT_Agent_Activity_Detail.dir0 = 2014 and
>> > > > > FACT_Agent_Activity_Detail.dir1 =
>> > > > >
>> > > > > 12
>> > > > >    )
>> > > > >
>> > > > >   );
>> > > > >
>> > > > >
>> > > > > Thanks,
>> > > > > Latha
>> > > > >
>> > > > >
>> > > > > This email (including any attachments) is proprietary to Aspect
>> > > > > Software, Inc. and may contain information that is confidential.
>> > > > > If you have received this message in error, please do not read,
>> > > > > copy or
>> > > > forward this message.
>> > > > > Please notify the sender immediately, delete it from your system
>> > > > > and destroy any copies. You may not further disclose or distribute
>> > > > > this email or its attachments.
>> > > > >
>> > > > This email (including any attachments) is proprietary to Aspect
>> > > > Software, Inc. and may contain information that is confidential. If
>> > > > you have received this message in error, please do not read, copy or
>> > > forward this message.
>> > > > Please notify the sender immediately, delete it from your system and
>> > > > destroy any copies. You may not further disclose or distribute this
>> > > > email or its attachments.
>> > > >
>> > > This email (including any attachments) is proprietary to Aspect
>> > > Software, Inc. and may contain information that is confidential. If
>> > > you have received this message in error, please do not read, copy or
>> > forward this message.
>> > > Please notify the sender immediately, delete it from your system and
>> > > destroy any copies. You may not further disclose or distribute this
>> > > email or its attachments.
>> > >
>> >
>>
>
>

Reply via email to