Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Okay,

Here is the status of the SA updates and a question:

Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.

Now I am working on making the proc even faster.  Since we have all of
the tokens coming in as an array, it should be possible to get this down
to just a couple of queries.

I have the proc using IN and NOT IN statements to update everything at
once from a temp table, but it progressively gets slower because the
temp table is growing between vacuums.  At this point it's slightly
slower than the old update or else insert on every token.

What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table.  If I could copy from an array into a record then I can
even get rid of the loop.  Anyone have any thoughts on how to do this?


CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
 intokenary BYTEA[],
 inspam_count INTEGER,
 inham_count INTEGER,
 inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;

  UPDATE
bayes_token
  SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, 1000)
  WHERE
id = inuserid
  AND
(token) IN (SELECT intoken FROM bayes_token_tmp);

  UPDATE
bayes_vars
  SET
token_count = token_count + (SELECT count(intoken) FROM
bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
id = inuserid;

  INSERT INTO
bayes_token
SELECT
  inuserid,
  intoken,
  inspam_count,
  inham_count,
  inatime
FROM
  bayes_token_tmp
WHERE
  (inspam_count  0 OR inham_count  0)
AND
  (intoken) NOT IN (SELECT token FROM bayes_token);

  delete from bayes_token_tmp;

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Indexed views.

2005-08-04 Thread prasanna s
Does postgres support indexed views/materialised views that some of the other databases support?

Thanks
Prasanna S


Re: [PERFORM] Indexed views.

2005-08-04 Thread Christopher Kings-Lynne

No, unless you use some custom triggers.

prasanna s wrote:
Does postgres support indexed views/materialised views that some of the 
other databases support?


Thanks
Prasanna S



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread PFC



What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table.  If I could copy from an array into a record then I can
even get rid of the loop.  Anyone have any thoughts on how to do this?


	You could make a set-returning-function (about 3 lines) which RETURNs  
NEXT every element in the array ; then you can use this SRF just like a  
table and SELECT from it, join it with your other tables, etc.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Indexed views.

2005-08-04 Thread Laszlo Hornyak

prasanna s wrote:

Does postgres support indexed views/materialised views that some of 
the other databases support?


Thanks
Prasanna S


Hi!

It is not supported, but perhaps this will help you: 
http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 Okay,

 Here is the status of the SA updates and a question:

 Michael got SA changed to pass an array of tokens to the proc so right
 there we gained a ton of performance due to connections and transactions
 being grouped into one per email instead of one per token.

 Now I am working on making the proc even faster.  Since we have all of
 the tokens coming in as an array, it should be possible to get this down
 to just a couple of queries.

 I have the proc using IN and NOT IN statements to update everything at
 once from a temp table, but it progressively gets slower because the
 temp table is growing between vacuums.  At this point it's slightly
 slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.


 What I really want to do is have the token array available as a record
 so that I can query against it, but not have it take up the resources of
 a real table.  If I could copy from an array into a record then I can
 even get rid of the loop.  Anyone have any thoughts on how to do this?


My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:


 CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
  intokenary BYTEA[],
  inspam_count INTEGER,
  inham_count INTEGER,
  inatime INTEGER)
 RETURNS VOID AS '
 DECLARE
   _token BYTEA;
 BEGIN


-- create the table at the start of the procedure
CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
-- You might also add primary key if you are going to be adding
-- *lots* of entries, but it sounds like you are going to have
-- less than 1 page, so it doesn't matter

   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
   LOOP
 _token := intokenary[i];
 INSERT INTO bayes_token_tmp VALUES (_token);
   END LOOP;

   UPDATE
 bayes_token
   SET
 spam_count = greatest_int(spam_count + inspam_count, 0),
 ham_count = greatest_int(ham_count + inham_count , 0),
 atime = greatest_int(atime, 1000)
   WHERE
 id = inuserid
   AND

--(token) IN (SELECT intoken FROM bayes_token_tmp);
  EXISTS (SELECT token FROM bayes_token_tmp
   WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
   DELETE FROM bayes_token_tmp
WHERE NOT EXISTS (SELECT token FROM bayes_token
   WHERE token=intoken);



   UPDATE
 bayes_vars
   SET

-- token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
   token_count = token_count + (SELECT count(intoken)
  FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

 newest_token_age = greatest_int(newest_token_age, inatime),
 oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE
 id = inuserid;

   INSERT INTO
 bayes_token
 SELECT
   inuserid,
   intoken,
   inspam_count,
   inham_count,
   inatime
 FROM
   bayes_token_tmp
 WHERE
   (inspam_count  0 OR inham_count  0)

-- AND
--   (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

--   delete from bayes_token_tmp;
--   And rather than deleting all of the entries just
 DROP TABLE bayes_token_tmp;


   RETURN;
 END;
 ' LANGUAGE 'plpgsql';

 CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
  LANGUAGE SQL;

 CREATE OR REPLACE FUNCTION least_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
  LANGUAGE SQL;


So to clarify, here is my finished function:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
 intokenary BYTEA[],
 inspam_count INTEGER,
 inham_count INTEGER,
 inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
_token := 

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Tom Lane
Matthew Schumacher [EMAIL PROTECTED] writes:
   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
   LOOP
 _token := intokenary[i];
 INSERT INTO bayes_token_tmp VALUES (_token);
   END LOOP;

   UPDATE
 bayes_token
   SET
 spam_count = greatest_int(spam_count + inspam_count, 0),
 ham_count = greatest_int(ham_count + inham_count , 0),
 atime = greatest_int(atime, 1000)
   WHERE
 id = inuserid
   AND
 (token) IN (SELECT intoken FROM bayes_token_tmp);

I don't really see why you think that this path is going to lead to
better performance than where you were before.  Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.  Throwing a pile of simple queries at the problem is not
necessarily the wrong way ... especially when you are doing it in
plpgsql, because you've already eliminated the overhead of network
round trips and repeated planning of the queries.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Tom Lane wrote:
 Matthew Schumacher [EMAIL PROTECTED] writes:

  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
  END LOOP;


  UPDATE
bayes_token
  SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, 1000)
  WHERE
id = inuserid
  AND
(token) IN (SELECT intoken FROM bayes_token_tmp);


 I don't really see why you think that this path is going to lead to
 better performance than where you were before.  Manipulation of the
 temp table is never going to be free, and IN (sub-select) is always
 inherently not fast, and NOT IN (sub-select) is always inherently
 awful.  Throwing a pile of simple queries at the problem is not
 necessarily the wrong way ... especially when you are doing it in
 plpgsql, because you've already eliminated the overhead of network
 round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 10-100200, 20-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:-




   regards, tom lane

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq


mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM 
object_t);
   QUERY PLAN
-
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual 
time=0.085..183.889 rows=1351 loops=1)
   -  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual 
time=0.014..24.032 rows=2009 loops=1)
   -  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.071..0.071 rows=1 loops=2009)
 Index Cond: (outer.id = object_t.id)
 Total runtime: 184.823 ms
(5 rows)

Time: 186.931 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual 
time=0.086..165.053 rows=1351 loops=1)
   Filter: (subplan)
   SubPlan
 -  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.025..0.025 rows=1 loops=2009)
   Index Cond: (id = $0)
 Total runtime: 165.995 ms
(6 rows)

Time: 167.795 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM 
object_t);
QUERY PLAN
---
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual 
time=4168.247..4172.080 rows=658 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 -  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) 
(actual time=0.044..2464.296 rows=811481 loops=1)
 Total runtime: 4210.784 ms
(5 rows)

Time: 4212.276 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual 

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
John A Meinel wrote:
 Matthew Schumacher wrote:
 
  I recommend that you drop and re-create the temp table. There is no
 reason to have it around, considering you delete and re-add everything.
 That means you never have to vacuum it, since it always only contains
 the latest rows.

Whenever I have a create temp and drop statement I get these errors:

select put_tokens(1, '{000}', 1, 1, 1000);
ERROR:  relation with OID 582248 does not exist
CONTEXT:  SQL statement INSERT INTO bayes_token_tmp VALUES ( $1 )
PL/pgSQL function put_tokens line 12 at SQL statement


 
 
 
 My one question here, is the inspam_count and inham_count *always* the
 same for all tokens? I would have thought each token has it's own count.
 Anyway, there are a few lines I would change:

No, we get the userid, inspam, inham, and atime, and they are the same
for each token.  If we have a different user we call the proc again.

 -- create the table at the start of the procedure
 CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
 -- You might also add primary key if you are going to be adding
 -- *lots* of entries, but it sounds like you are going to have
 -- less than 1 page, so it doesn't matter

This causes errors, see above

 --(token) IN (SELECT intoken FROM bayes_token_tmp);
   EXISTS (SELECT token FROM bayes_token_tmp
  WHERE intoken=token LIMIT 1);
 
 -- I would also avoid your intoken (NOT) IN (SELECT token FROM
 -- bayes_token) There are a few possibilities, but to me
 -- as your bayes_token table becomes big, this will start
 -- to be the slow point
 
 -- Rather than doing 2 NOT IN queries, it *might* be faster to do
DELETE FROM bayes_token_tmp
   WHERE NOT EXISTS (SELECT token FROM bayes_token
  WHERE token=intoken);
 


I'll look into this.


thanks,

schu

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Tom Lane
John A Meinel [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I don't really see why you think that this path is going to lead to
 better performance than where you were before.

 So for an IN (sub-select), does it actually pull all of the rows from
 the other table, or is the planner smart enough to stop once it finds
 something?

It stops when it finds something --- but it's still a join operation
in essence.  I don't see that putting the values one by one into a table
and then joining is going to be a win compared to just processing the
values one at a time against the main table.

 Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?
 What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

The EXISTS variants are actually worse, because we've not spent as much
time teaching the planner how to optimize them.  There's effectively
only one decent plan for an EXISTS, which is that the subselect's x is
indexed and we do an indexscan probe using the outer y for each outer
row.  IN and NOT IN can do that, or several alternative plans that might
be better depending on data statistics.

However, that's cold comfort for Matthew's application -- the only way
he'd get any benefit from all those planner smarts is if he ANALYZEs
the temp table after loading it and then EXECUTEs the main query (so
that it gets re-planned every time).  Plus, at least some of those
alternative plans would require an index on the temp table, which is
unlikely to be worth the cost of setting up.  And finally, this
formulation requires separate IN and NOT IN tests that are necessarily
going to do a lot of redundant work.

There's enough overhead here that I find it highly doubtful that it'll
be a win compared to the original approach of retail queries against the
main table.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Tom Lane wrote:

 I don't really see why you think that this path is going to lead to
 better performance than where you were before.  Manipulation of the
 temp table is never going to be free, and IN (sub-select) is always
 inherently not fast, and NOT IN (sub-select) is always inherently
 awful.  Throwing a pile of simple queries at the problem is not
 necessarily the wrong way ... especially when you are doing it in
 plpgsql, because you've already eliminated the overhead of network
 round trips and repeated planning of the queries.
 
   regards, tom lane

The reason why I think this may be faster is because I would avoid
running an update on data that needs to be inserted which saves
searching though the table for a matching token.

Perhaps I should do the insert first, then drop those tokens from the
temp table, then do my updates in a loop.

I'll have to do some benchmarking...

schu

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] nice/low priority Query

2005-08-04 Thread Neil Conway

Jim C. Nasby wrote:

Actually, from what I've read 4.2BSD actually took priority into account
when scheduling I/O.


FWIW, you can set I/O priority in recent versions of the Linux kernel 
using ionice, which is part of RML's schedutils package (which was 
recently merged into util-linux).


-Neil

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
Matthew Schumacher wrote:
 Tom Lane wrote:
 
 
I don't really see why you think that this path is going to lead to
better performance than where you were before.  Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.  Throwing a pile of simple queries at the problem is not
necessarily the wrong way ... especially when you are doing it in
plpgsql, because you've already eliminated the overhead of network
round trips and repeated planning of the queries.

  regards, tom lane
 
 
 The reason why I think this may be faster is because I would avoid
 running an update on data that needs to be inserted which saves
 searching though the table for a matching token.
 
 Perhaps I should do the insert first, then drop those tokens from the
 temp table, then do my updates in a loop.
 
 I'll have to do some benchmarking...
 
 schu

Tom, I think your right, whenever I do a NOT IN it does a full table
scan against bayes_token and since that table is going to get very big
doing the simple query in a loop that uses an index seems a bit faster.

John, thanks for your help, it was worth a try, but it looks like the
looping is just faster.

Here is what I have so far in case anyone else has ideas before I
abandon it:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
 intokenary BYTEA[],
 inspam_count INTEGER,
 inham_count INTEGER,
 inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  UPDATE
bayes_token
  SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, inatime)
  WHERE
id = inuserid
  AND
(token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

  UPDATE
bayes_vars
  SET
token_count = token_count + (
  SELECT
count(bayes_token_tmp)
  FROM
bayes_token_tmp(intokenary)
  WHERE
bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
id = inuserid;

  INSERT INTO
bayes_token
SELECT
  inuserid,
  bayes_token_tmp,
  inspam_count,
  inham_count,
  inatime
FROM
  bayes_token_tmp(intokenary)
WHERE
  (inspam_count  0 OR inham_count  0)
AND
  (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

  RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
SETOF bytea AS
'
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
return next intokenary[i];
  END LOOP;
  RETURN;
end
'
language 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 Matthew Schumacher wrote:

Tom Lane wrote:



I don't really see why you think that this path is going to lead to
better performance than where you were before.  Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.  Throwing a pile of simple queries at the problem is not
necessarily the wrong way ... especially when you are doing it in
plpgsql, because you've already eliminated the overhead of network
round trips and repeated planning of the queries.

 regards, tom lane


The reason why I think this may be faster is because I would avoid
running an update on data that needs to be inserted which saves
searching though the table for a matching token.

Perhaps I should do the insert first, then drop those tokens from the
temp table, then do my updates in a loop.

I'll have to do some benchmarking...

schu


 Tom, I think your right, whenever I do a NOT IN it does a full table
 scan against bayes_token and since that table is going to get very big
 doing the simple query in a loop that uses an index seems a bit faster.

 John, thanks for your help, it was worth a try, but it looks like the
 looping is just faster.

 Here is what I have so far in case anyone else has ideas before I
 abandon it:

Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using EXECUTE which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:-



 CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
  intokenary BYTEA[],
  inspam_count INTEGER,
  inham_count INTEGER,
  inatime INTEGER)
 RETURNS VOID AS '
 DECLARE
   _token BYTEA;
 BEGIN

   UPDATE
 bayes_token
   SET
 spam_count = greatest_int(spam_count + inspam_count, 0),
 ham_count = greatest_int(ham_count + inham_count , 0),
 atime = greatest_int(atime, inatime)
   WHERE
 id = inuserid
   AND
 (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

   UPDATE
 bayes_vars
   SET
 token_count = token_count + (
   SELECT
 count(bayes_token_tmp)
   FROM
 bayes_token_tmp(intokenary)
   WHERE
 bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
 newest_token_age = greatest_int(newest_token_age, inatime),
 oldest_token_age = least_int(oldest_token_age, inatime)
   WHERE
 id = inuserid;

   INSERT INTO
 bayes_token
 SELECT
   inuserid,
   bayes_token_tmp,
   inspam_count,
   inham_count,
   inatime
 FROM
   bayes_token_tmp(intokenary)
 WHERE
   (inspam_count  0 OR inham_count  0)
 AND
   (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

   RETURN;
 END;
 ' LANGUAGE 'plpgsql';

 CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
 SETOF bytea AS
 '
 BEGIN
   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
   LOOP
 return next intokenary[i];
   END LOOP;
   RETURN;
 end
 '
 language 'plpgsql';

 CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $2 ELSE $1 END;'
  LANGUAGE SQL;

 CREATE OR REPLACE FUNCTION least_int (integer, integer)
  RETURNS INTEGER
  IMMUTABLE STRICT
  AS 'SELECT CASE WHEN $1  $2 THEN $1 ELSE $2 END;'
  LANGUAGE SQL;

 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?

http://archives.postgresql.org




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread Matthew Schumacher
John A Meinel wrote:

 Surely this isn't what you have. You have *no* loop here, and you have
 stuff like:
   AND
 (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
 
 I'm guessing this isn't your last version of the function.
 
 As far as putting the CREATE TEMP TABLE inside the function, I think the
 problem is that the first time it runs, it compiles the function, and
 when it gets to the UPDATE/INSERT with the temporary table name, at
 compile time it hard-codes that table id.
 
 I tried getting around it by using EXECUTE which worked, but it made
 the function horribly slow. So I don't recommend it.
 
 Anyway, if you want us to evaluate it, you really need to send us the
 real final function.
 
 John
 =:-

It is the final function.  It doesn't need a loop because of the
bayes_token_tmp function I added.  The array is passed to it and it
returns a record set so I can work off of it like it's a table.  So the
function works the same way it before, but instead of using SELECT
intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
bayes_token_tmp(intokenary).

I think this is more efficient than the create table overhead,
especially because the incoming record set won't be to big.

Thanks,

schu


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
 John A Meinel wrote:


Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using EXECUTE which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:-


 It is the final function.  It doesn't need a loop because of the
 bayes_token_tmp function I added.  The array is passed to it and it
 returns a record set so I can work off of it like it's a table.  So the
 function works the same way it before, but instead of using SELECT
 intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
 bayes_token_tmp(intokenary).

 I think this is more efficient than the create table overhead,
 especially because the incoming record set won't be to big.

 Thanks,

 schu



Well, I would at least recommend that you change the WHERE
bayes_token_tmp NOT IN (SELECT token FROM bayes_token)
with a
WHERE NOT EXISTS (SELECT toke FROM bayes_token WHERE
token=bayes_token_tmp)

You might try experimenting with the differences, but on my system the
NOT IN has to do a full sequential scan on bayes_token and load all
entries into a list, while NOT EXISTS can do effectively a nested loop.

The nested loop requires that there is an index on bayes_token(token),
but I'm pretty sure there is anyway.

Again, in my testing, it was a difference of 4200ms versus 180ms. (800k
rows in my big table, 2k in the temp one)

John
=:-



signature.asc
Description: OpenPGP digital signature