Hello Tomas,
yes, of course I know that an 'ORDER BY' will actually increase the CPU
effort for the query.
However, my consideration was: Why is the original query running eternally?
My thoughts were: The IDs to be filtered against in the DELETE [...]
WHERE id NOT IN ([subselect]) were randomly ordered;
I know that working on a non-sorted data input is requiring much more
effort than just going for a quick sort and working on the product.
So there's the reason for that.
But let me get these 'explain's for you:
###########
ORIGINAL:
###########
spacewalk_1_5=# explain Delete from rhnPackageChangeLogData WHERE id NOT
IN ( SELECT changelog_data_id FROM rhnPackageChangeLogRec);
QUERY PLAN
----------------------------------------------------------------------------------------------
Seq Scan on rhnpackagechangelogdata (cost=108146.08..11134104860.94
rows=151984 width=6)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=108146.08..170240.97 rows=4465189 width=8)
-> Seq Scan on rhnpackagechangelogrec (cost=0.00..86237.89
rows=4465189 width=8)
(5 rows)
################
MODIFICATION:
################
spacewalk_1_5=# explain Delete from rhnPackageChangeLogData WHERE id NOT
IN ( SELECT DISTINCT changelog_data_id FROM rhnPackageChangeLogRec ORDER
BY changelog_data_id ASC);
QUERY PLAN
-------------------------------------------------------------------------------------------------
---
Seq Scan on rhnpackagechangelogdata (cost=104819.05..115188.64
rows=151984 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Sort (cost=104433.39..104626.22 rows=77132 width=8)
Sort Key: rhnpackagechangelogrec.changelog_data_id
-> HashAggregate (cost=97400.86..98172.18 rows=77132 width=8)
-> Seq Scan on rhnpackagechangelogrec
(cost=0.00..86237.89 rows=4465189 width=
8)
(7 rows)
####################
Ok, so I went for an 'EXPLAIN ANALYZE', on three queries:
The Original, my 'original' modification and modification without the
'ORDER BY';
Turns out, without the ORDER BY, the query is another 600msec faster,
while the 'Explain Analyze' of the original is again *still* executing
(since 5++ Minutes).
So my bad for not testing without the ORDER BY - a hunch of mine in the
wrong direction.
Anyhow: Here are the 'EXPLAIN ANALYSE' of Mod, and Mod-without-ORDER-BY:
#############
Modified:
#############
spacewalk_1_5=# explain analyze Delete from rhnPackageChangeLogData
WHERE id NOT IN ( SELECT DISTINCT changelog_data_id FROM
rhnPackageChangeLogRec ORDER BY changelog_data_id ASC);
QUERY PLAN
-------------------------------------------------------------------------------------------------
-----------------------------------------------------
Seq Scan on rhnpackagechangelogdata (cost=104819.05..115188.64
rows=151984 width=6) (actual tim
e=3476.682..3476.682 rows=0 loops=1)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Sort (cost=104433.39..104626.22 rows=77132 width=8) (actual
time=3149.620..3192.113 row
s=301937 loops=1)
Sort Key: rhnpackagechangelogrec.changelog_data_id
Sort Method: quicksort Memory: 26442kB
-> HashAggregate (cost=97400.86..98172.18 rows=77132
width=8) (actual time=2473.983.
.2553.576 rows=301937 loops=1)
-> Seq Scan on rhnpackagechangelogrec
(cost=0.00..86237.89 rows=4465189 width=
8) (actual time=0.008..730.139 rows=4466337 loops=1)
Total runtime: 3483.172 ms
(9 rows)
############
Modified without ORDER-BY
############
spacewalk_1_5=# explain analyze Delete from rhnPackageChangeLogData
WHERE id NOT IN ( SELECT DISTINCT changelog_data_id FROM
rhnPackageChangeLogRec);
QUERY PLAN
-------------------------------------------------------------------------------------------------
-----------------------------------------------
Seq Scan on rhnpackagechangelogdata (cost=98365.01..108734.60
rows=151984 width=6) (actual time
=2869.294..2869.294 rows=0 loops=1)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> HashAggregate (cost=97400.86..98172.18 rows=77132 width=8)
(actual time=2494.079..2577.
750 rows=301937 loops=1)
-> Seq Scan on rhnpackagechangelogrec (cost=0.00..86237.89
rows=4465189 width=8) (ac
tual time=0.011..741.078 rows=4466337 loops=1)
Total runtime: 2873.545 ms
(6 rows)
##############
Original...
##############
[no output yet, if it completes before my remote ssh is cut, I'll report
it.]
Best
-Jonathan
On 08/19/2011 02:54 PM, Tomas Lestach wrote:
Hello Jonathan,
Testdriving the two queries got me the following:
Original: ....still running since 30minutes now (top-CPU time)
modified: 3184ms
The speed improvement is very nice. However here're my comments:
- both DISTINCT and ORDER are syntactically correct for Oracle
- ORDER BY cannot make any speed improvements, in opposite, it take some time
to sort the query
- regarding DISTINCT, I'm not sure, if it can have such a great impact on
performance in this concrete case
Could you, please, send explain plans for the original and the 'DISTINCT
variant'?
Regards,
Tomas
--
Tomas Lestach
RHN Satellite Engineering, Red Hat
_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel
--
Jonathan Hoser, M.Sc.
Institute of Bioinformatics and System Biology
WWW: http://mips.helmholtz-muenchen.de
Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671
_______________________________________________
Spacewalk-devel mailing list
Spacewalk-devel@redhat.com
https://www.redhat.com/mailman/listinfo/spacewalk-devel