Re: Optimizing SQL Query

2015-03-09 Thread anamika gupta
Please fine the query plan

scala sqlContext.sql(SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS
AVG_SDP_USAGE FROM (SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE)
AS SDP_USAGE FROM (SELECT * FROM date_d AS dd JOIN interval_f AS intf ON
intf.DATE_WID = dd.WID WHERE intf.DATE_WID = 20150101 AND intf.DATE_WID =
20150110 AND CAST(INTERVAL_END_TIME AS STRING) = '2015-01-01 00:00:00.000'
AND CAST(INTERVAL_END_TIME AS STRING) = '2015-01-10 00:00:00.000' AND
MEAS_WID = 3) AS test JOIN sdp_d AS sdp on test.SDP_WID = sdp.WID where
sdp.UDC_ID = 'SP-168451834' group by sdp.WID, DAY_OF_WEEK, HOUR) AS dw
group by dw.DAY_OF_WEEK, dw.HOUR)


q2: org.apache.spark.sql.SchemaRDD = SchemaRDD[36] at RDD at
SchemaRDD.scala:103

== Query Plan ==
== Physical Plan ==

Aggregate false, [DAY_OF_WEEK#3,HOUR#43L],
[DAY_OF_WEEK#3,HOUR#43L,(CAST(SUM(PartialSum#133), DoubleType) /
CAST(SUM(PartialCount#134L), DoubleType)) AS AVG_SDP_USAGE#126]
 Exchange (HashPartitioning [DAY_OF_WEEK#3,HOUR#43L], 200)
  Aggregate true, [DAY_OF_WEEK#3,HOUR#43L],
[DAY_OF_WEEK#3,HOUR#43L,COUNT(SDP_USAGE#130) AS
PartialCount#134L,SUM(SDP_USAGE#130) AS PartialSum#133]
   Project [DAY_OF_WEEK#3,HOUR#43L,SDP_USAGE#130]
Aggregate false, [WID#49,DAY_OF_WEEK#3,HOUR#43L],
[WID#49,DAY_OF_WEEK#3,HOUR#43L,SUM(PartialSum#136) AS SDP_USAGE#130]
 Exchange (HashPartitioning [WID#49,DAY_OF_WEEK#3,HOUR#43L], 200)
  Aggregate true, [WID#49,DAY_OF_WEEK#3,HOUR#43L], [...


Re: Optimizing SQL Query

2015-03-06 Thread daniel queiroz
Dude,

please, attach the execution plan of the query and details about the
indexes.



2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com:

 I have a query that's like:

 Could you help in providing me pointers as to how to start to optimize it
 w.r.t. spark sql:


 sqlContext.sql(

 SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

 FROM (
SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
 SDP_USAGE

FROM (
  SELECT *

  FROM date_d dd JOIN interval_f intf

  ON intf.DATE_WID = dd.WID

  WHERE intf.DATE_WID = 20141116 AND
 intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) =
 '2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME
AS STRING) = '2014-11-26
 00:00:00.000' AND MEAS_WID = 3

   ) test JOIN sdp_d sdp

ON test.SDP_WID = sdp.WID

WHERE sdp.UDC_ID = 'SP-1931201848'

GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

) dw

 GROUP BY dw.DAY_OF_WEEK, dw.HOUR)



 Currently the query takes 15 minutes execution time where interval_f table
 holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org