Hello Tomas,

yes, of course I know that an 'ORDER BY' will actually increase the CPU
effort for the query.
However, my consideration was: Why is the original query running eternally?
My thoughts were: The IDs to be filtered against in the DELETE [...]
WHERE id NOT IN ([subselect]) were randomly ordered;
I know that working on a non-sorted data input is requiring much more
effort than just going for a quick sort and working on the product.
So there's the reason for that.

But let me get these 'explain's for you:
###########
ORIGINAL:
###########
spacewalk_1_5=# explain Delete from rhnPackageChangeLogData WHERE id NOT
IN ( SELECT changelog_data_id FROM rhnPackageChangeLogRec);
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on rhnpackagechangelogdata  (cost=108146.08..11134104860.94
rows=151984 width=6)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=108146.08..170240.97 rows=4465189 width=8)
           ->  Seq Scan on rhnpackagechangelogrec  (cost=0.00..86237.89
rows=4465189 width=8)
(5 rows)

################
MODIFICATION:
################
spacewalk_1_5=# explain Delete from rhnPackageChangeLogData WHERE id NOT
IN ( SELECT DISTINCT changelog_data_id FROM rhnPackageChangeLogRec ORDER
BY changelog_data_id ASC);
                                             QUERY PLAN

-------------------------------------------------------------------------------------------------
---
 Seq Scan on rhnpackagechangelogdata  (cost=104819.05..115188.64
rows=151984 width=6)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Sort  (cost=104433.39..104626.22 rows=77132 width=8)
           Sort Key: rhnpackagechangelogrec.changelog_data_id
           ->  HashAggregate  (cost=97400.86..98172.18 rows=77132 width=8)
                 ->  Seq Scan on rhnpackagechangelogrec
(cost=0.00..86237.89 rows=4465189 width=
8)
(7 rows)
####################

Ok, so I went for an 'EXPLAIN ANALYZE', on three queries:
The Original, my 'original' modification and modification without the
'ORDER BY';
Turns out, without the ORDER BY, the query is another 600msec faster,
while the 'Explain Analyze' of the original is again *still* executing
(since 5++ Minutes).
So my bad for not testing without the ORDER BY - a hunch of mine in the
wrong direction.

Anyhow: Here are the 'EXPLAIN ANALYSE' of Mod, and Mod-without-ORDER-BY:

#############
Modified:
#############
spacewalk_1_5=# explain analyze Delete from rhnPackageChangeLogData
WHERE id NOT IN ( SELECT DISTINCT changelog_data_id FROM
rhnPackageChangeLogRec ORDER BY changelog_data_id ASC);

QUERY PLAN

-------------------------------------------------------------------------------------------------
-----------------------------------------------------
 Seq Scan on rhnpackagechangelogdata  (cost=104819.05..115188.64
rows=151984 width=6) (actual tim
e=3476.682..3476.682 rows=0 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Sort  (cost=104433.39..104626.22 rows=77132 width=8) (actual
time=3149.620..3192.113 row
s=301937 loops=1)
           Sort Key: rhnpackagechangelogrec.changelog_data_id
           Sort Method:  quicksort  Memory: 26442kB
           ->  HashAggregate  (cost=97400.86..98172.18 rows=77132
width=8) (actual time=2473.983.
.2553.576 rows=301937 loops=1)
                 ->  Seq Scan on rhnpackagechangelogrec
(cost=0.00..86237.89 rows=4465189 width=
8) (actual time=0.008..730.139 rows=4466337 loops=1)
 Total runtime: 3483.172 ms
(9 rows)

############
Modified without ORDER-BY
############
spacewalk_1_5=# explain analyze Delete from rhnPackageChangeLogData
WHERE id NOT IN ( SELECT DISTINCT changelog_data_id FROM
rhnPackageChangeLogRec);

QUERY PLAN

-------------------------------------------------------------------------------------------------
-----------------------------------------------
 Seq Scan on rhnpackagechangelogdata  (cost=98365.01..108734.60
rows=151984 width=6) (actual time
=2869.294..2869.294 rows=0 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  HashAggregate  (cost=97400.86..98172.18 rows=77132 width=8)
(actual time=2494.079..2577.
750 rows=301937 loops=1)
           ->  Seq Scan on rhnpackagechangelogrec  (cost=0.00..86237.89
rows=4465189 width=8) (ac
tual time=0.011..741.078 rows=4466337 loops=1)
 Total runtime: 2873.545 ms
(6 rows)

##############
Original...
##############
[no output yet, if it completes before my remote ssh is cut, I'll report
it.]

Best
-Jonathan

On 08/19/2011 02:54 PM, Tomas Lestach wrote:
Hello Jonathan,

Testdriving the two queries got me the following:
Original:    ....still running since 30minutes now (top-CPU time)
modified:  3184ms
The speed improvement is very nice. However here're my comments:
  - both DISTINCT and ORDER are syntactically correct for Oracle
  - ORDER BY cannot make any speed improvements, in opposite, it take some time 
to sort the query
  - regarding DISTINCT, I'm not sure, if it can have such a great impact on 
performance in this concrete case
    Could you, please, send explain plans for the original and the 'DISTINCT 
variant'?


Regards,
Tomas
--
Tomas Lestach
RHN Satellite Engineering, Red Hat

_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel



--
Jonathan Hoser, M.Sc.
Institute of Bioinformatics and System Biology

WWW: http://mips.helmholtz-muenchen.de



Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671

_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel

Reply via email to