[SQL] Need another way to do this, my sql much to slow...
I am attempting to write a spam management add-on for my corporate server, I have the database and the apps written but one part is WAY to slow to be usefull. The following view: Column | Type | Modifiers ---+-+--- env_sender_num| integer | msg_from_claims_count | bigint | is built on the follow table: Column |Type | Modifiers -+-+--- msg_sender_num | integer | env_sender_num | integer | msg_from_domain | character varying(255) | env_from_domain | character varying(255) | recorded_date | timestamp without time zone | Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num), record_of_claims_msg_sender_num_idx btree (msg_sender_num) With the following SQL: SELECT record_of_claims.env_sender_num, count(DISTINCT record_of_claims.msg_sender_num) AS msg_from_claims_count FROM record_of_claims GROUP BY record_of_claims.env_sender_num; A sample dataset follows: msg_sender_num | env_sender_num |msg_from_domain| env_from_domain|recorded_date ++---+-- -+- 1 | 1 | yahoo.com | yahoo.com | 2003-11-18 13:21:07 2 | 2 | mx128.optinvc13y.com | mx128.optinvc13y.com | 2003-11-18 13:21:16 3 | 3 | yahoo.com | yahoo.com | 2003-11-18 13:21:17 4 | 4 | yahoo.com | yahoo.com | 2003-11-18 13:21:21 5 | 5 | biomarketgroup.com| biomarketgroup.com| 2003-11-18 13:21:24 6 | 6 | sohu.com | sohu.com | 2003-11-18 13:21:28 7 | 7 | lycos.com | lycos.com | 2003-11-18 13:21:38 8 | 8 | mail.expressrx.info | mail.expressrx.info | 2003-11-18 13:21:41 9 | 9 | approveddeals.com | approveddeals.com | 2003-11-18 13:21:41 10 | 10 | conceptholidays.co.uk | conceptholidays.co.uk | 2003-11-18 13:21:48 The msg_sender_num and env_sender_num come from another table of unique names of senders. What I am attempting to do is see how many times msg_sender_num 1 claims to be a different env_sender_num. So I have to find all the entries in msg_sender_num equal to 1 and build a count of the distinct numbers in env_sender_num. This number is then used later to say that if a msg_sender_num claims to be more then 2 env_sender_num's then the sender is a spammer and gets added to a list. Everything is working fine except the SQL above. It takes WAY to long to process on a 50+ record database. Hell it takes 12 seconds or so on a 5 record database. I have included the query plan to show that the indexes are being used. Query Plan: Aggregate (cost=0.00..166.16 rows=264 width=8) (actual time=0.98..7768.19 rows=62911 loops=1) -> Group (cost=0.00..159.57 rows=2635 width=8) (actual time=0.56..3179.14 rows=80466 loops=1) -> Index Scan using record_of_claims_env_sender_num_idx on record_of_claims (cost=0.00..152.99 rows=2635 width=8) (actual time=0.55..2240.15 rows=80466 loops=1) Total runtime: 7931.63 msec Is there a better, read "Faster", way to achieve this? Jerry Wintrode Very Newbie Postgres User ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [SQL] Need another way to do this, my sql much to slow...
> > I am attempting to write a spam management add-on for my corporate > server, I have the database and the apps written but one part is WAY to > slow to be usefull. > > The following view: > > Column | Type | Modifiers > ---+-+--- > env_sender_num| integer | > msg_from_claims_count | bigint | > > is built on the follow table: > > Column |Type | Modifiers > -+-+--- > msg_sender_num | integer | > env_sender_num | integer | > msg_from_domain | character varying(255) | > env_from_domain | character varying(255) | > recorded_date | timestamp without time zone | > Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num), > record_of_claims_msg_sender_num_idx btree (msg_sender_num) > > With the following SQL: > > > SELECT record_of_claims.env_sender_num, count(DISTINCT > record_of_claims.msg_sender_num) > AS msg_from_claims_count=20 > FROM record_of_claims=20 > GROUP BY record_of_claims.env_sender_num; > > A sample dataset follows: > [snip] Not sure, if this can speed up things SELECT env_sender_num, COUNT(msg_sender_num) AS msg_from_claims_count FROM ( SELECT DISTINCT ON (msg_sender_num) msg_sender_num,env_sender_num FROM record_of_claims ORDER BY msg_sender_num,env_sender_num DESC ) foo GROUP BY env_sender_num; but possibly it inspires you or someone else for a better one. Regards, Christoph ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [SQL] Need another way to do this, my sql much to slow...
Let me give you a better dataset:
msg_sender_num | env_sender_num
+
118 | 53003
118 | 51778
118 | 49679
118 | 49457
118 | 37434
118 | 37389
118 | 33644
118 | 33609
118 | 26043
118 | 26004
118 | 24288
118 | 23357
118 | 16246
118 | 16103
118 | 12967
118 | 12140
118 | 4191
118 |122
118 |860
with the SQL:
SELECT record_of_claims.msg_sender_num, count(DISTINCT
record_of_claims.env_sender_num) AS env_from_claims_count FROM
record_of_claims WHERE (record_of_claims.msg_sender_num =118) GROUP BY
record_of_claims.msg_sender_num;
You get:
msg_sender_num | env_from_claims_count
+---
118 |19
Which is correct for the following reason:
msg_sender_num |envelope_from
+-
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
118 | [EMAIL PROTECTED]
19 different envelope from names all claiming to be the same Message
from: 118 ("TopOffers [EMAIL PROTECTED]")
All of the above address would be added to blacklist for 120 days.
If I say every 15 minutes or so create a new table full of the results
of the SQL view it sort of solves my problem. Still takes forever to
process but the next view that needs these results can do an index scan
on the resulting table and not have to build the list all over again. In
this was I can cut my processing time to 6/10th of a second. But I have
to create/drop/rename tables on a time interval. Not the best solution.
Jerry Wintrode
Network Administrator
Tripos, Inc.
---(end of broadcast)---
TIP 8: explain analyze is your friend
Re: [SQL] Need another way to do this, my sql much to slow...
Ok, I figured that part out by simply changing the way I'm doing to
query, and writing a function to handle the reply. But it will introduce
another problem. How to I pass special characters, any character, to a
function like this:
select msg_2_env('"Ann's Free Gifts & Coupons"
<[EMAIL PROTECTED]>');
As you can see the message from name is:
"Ann's Free Gifts & Coupons" [EMAIL PROTECTED]
I need that whole string to match. Including the ",&,@, and yes the
single quote in Ann's. Passed as a variable this should not be a
problem, I think, but how do I test this on the command line with psql?
Oh, here is the simple function in case anyone cares to have it...very
simple. Now processing about 10 records takes 1ms. Down from the
12-15 seconds. WooHoo. Just that other little issue..hehehe.
CREATE FUNCTION msg_2_env (text) RETURNS int4 AS
'
DECLARE
intext ALIAS FOR $1;
result int4;
BEGIN
result := ( SELECT count(DISTINCT
record_of_claims.msg_sender_num) AS mclaim_count FROM record_of_claims
WHERE (record_of_claims.env_sender_num = (SELECT
env_from_senders.env_sender_num FROM env_from_senders WHERE
(env_from_senders.envelope_from = intext::character varying))) GROUP BY
record_of_claims.env_sender_num );
RETURN result;
END;
' LANGUAGE 'plpgsql';
Jerry Wintrode
Network Administrator
Tripos, Inc.
---(end of broadcast)---
TIP 8: explain analyze is your friend
---(end of broadcast)---
TIP 8: explain analyze is your friend
Re: [SQL] FATAL 2: PageIndexTupleDelete
Yes, you right! I used pg_dump command and now is working. Thank you very much! PS. I have version 7.2.2 on RH 8.0 > > FATAL 2: PageIndexTupleDelete: corrupted page pointers: lower =3D 12289, up= > > per =3D > > 1, special =3D 60672 > > > Why do I have this error message and how can I fix it? > > This is a data-corruption problem. Have you had any system crashes or > other odd behavior lately? > > As far as recovering, you can probably fix the broken index with > REINDEX, if the corruption doesn't extend further than the index that's > being complained of. Otherwise you might have to resort to a dump and > reload. > > You should also think about updating your Postgres installation. You > didn't say what version you're running (naughty naughty, that's required > info in any trouble report) but I can tell from the spelling of the > error message that it's not a current release. > > regards, tom lane > ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Expressional Indexes
"Randolf Richardson, DevNet SysOp 29" <[EMAIL PROTECTED]> writes: > For the application I will need to develop in the future, it would be > okay for this index to hold data more than 30 days old, and then be > redefined on a monthly basis along with regular database vacuuming (and > other maintenance). Okay ... > Could this be done with a "partial index" as follows? > CREATE INDEX my_index on my_table (create_date) > WHERE (create_date > age(timestamp '30 days')); No, because the above represents a moving cutoff; it will (and should) be rejected as a non-immutable predicate condition. You could do something like CREATE INDEX my_Nov_03_index on my_table (create_date) WHERE (create_date >= date '2003-11-01'); and then a month from now replace this with CREATE INDEX my_Dec_03_index on my_table (create_date) WHERE (create_date >= date '2003-12-01'); bearing in mind that this index can be used with queries that contain WHERE conditions like "create_date >= some-date-constant". The planner must be able to convince itself that the right-hand side of the WHERE condition is >= the cutoff in the index's predicate condition. Since the planner is not very bright, both items had better be simple DATE constants, or it won't be able to figure it out ... > So an "Expression Index" could, for example, be used to sort alpha- > numeric data in a case-insensitive manner? I just want to make sure I'm > understanding this correctly. It won't do anything you could not have done in prior releases using a custom-built function as the named function of a functional index. In that sense it's just a notational improvement rather than a real advance in capability. (Actually I guess there is an advance in capability: 7.4 lets you build a multi-column index on multiple expressions, whereas in prior releases a functional index could only have one index column. Not sure how important this is in practice though.) regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
