Re: Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Mich Talebzadeh
Since you are using Spark Thrift Server (which in turn uses Hive Thrift
Server) I have this suspicion that it uses Hive optimiser which indicates
that stats do matter. However, that may be just an assumption.

Have you partitioned these parquet tables?

Is it worth logging to Hive and run the same queries in Hive with EXPLAIN
EXTENDED select .. Can you see whether the relevant partition is picked
up?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 12:03, Сергей Романов  wrote:

> Hi, Mich,
>
> Column x29 does not seems to be any special. It's a newly created table
> and I did not calculate stats for any columns. Actually, I can sum a single
> column several times in query and face some landshift performance hit at
> some "magic" point. Setting "set spark.sql.codegen.wholeStage=false"
> makes all requests run in a similar slow time (which is slower original
> time).
>
> PS. Does Stats even helps for Spark queries?
>
> SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY
> field
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`,  SUM(`dd_convs`) AS `dd_convs`  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.37 seconds)
> 30 rows selected (1.382 seconds)
> 30 rows selected (1.399 seconds)
>
> SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY
> field
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`,  SUM(`actual_dsp_fee`) AS `actual_dsp_fee`  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.379 seconds)
> 30 rows selected (1.382 seconds)
> 30 rows selected (1.377 seconds)
>
> SELECT field, SUM(x28) x repeat 40 times FROM parquet_table WHERE
> partition = 1 GROUP BY field -> 1.774s
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.774 seconds)
> 30 rows selected (1.721 seconds)
> 30 rows selected (2.813 seconds)
>
> SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE
> partition = 1 GROUP BY field -> 7.314s
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt =
> '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30;
> 30 rows selected (7.314 seconds)
> 30 rows selected (7.27 seconds)
> 30 rows selected (7.279 seconds)
>
> SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1
> -> 1.378s
>
> 0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 

Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Сергей Романов

Hi, Mich,

Column x29 does not seems to be any special. It's a newly created table and I 
did not calculate stats for any columns. Actually, I can sum a single column 
several times in query and face some landshift performance hit at some "magic" 
point. Setting "set spark.sql.codegen.wholeStage=false" makes all requests run 
in a similar slow time (which is slower original time).
PS. Does Stats even helps for Spark queries?
SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`,  SUM(`dd_convs`) AS `dd_convs`  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30;
30 rows selected (1.37 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.399 seconds)

SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`,  SUM(`actual_dsp_fee`) AS `actual_dsp_fee`  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30; 
30 rows selected (1.379 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.377 seconds)
SELECT field, SUM(x28) x repeat 40 times  FROM parquet_table WHERE partition = 
1 GROUP BY field -> 1.774s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30;
30 rows selected (1.774 seconds)
30 rows selected (1.721 seconds)
30 rows selected (2.813 seconds)
SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE partition = 1 
GROUP BY field -> 7.314s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  
FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY 
`advertiser_id`  LIMIT 30;
30 rows selected (7.314 seconds)
30 rows selected (7.27 seconds)
30 rows selected (7.279 seconds)
SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> 
1.378s
0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28';
plan  == Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L),