Okay,
Here is the status of the SA updates and a question:
Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.
Now I am working on making the
What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table. If I could copy from an array into a record then I can
even get rid of the loop. Anyone have any thoughts on how to do this?
You
Matthew Schumacher wrote:
Okay,
Here is the status of the SA updates and a question:
Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.
Matthew Schumacher [EMAIL PROTECTED] writes:
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
END LOOP;
UPDATE
bayes_token
SET
spam_count = greatest_int(spam_count +
Tom Lane wrote:
Matthew Schumacher [EMAIL PROTECTED] writes:
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
END LOOP;
UPDATE
bayes_token
SET
spam_count =
John A Meinel wrote:
Matthew Schumacher wrote:
I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.
Whenever I have
John A Meinel [EMAIL PROTECTED] writes:
Tom Lane wrote:
I don't really see why you think that this path is going to lead to
better performance than where you were before.
So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to
Tom Lane wrote:
I don't really see why you think that this path is going to lead to
better performance than where you were before. Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is always inherently
awful.
Matthew Schumacher wrote:
Tom Lane wrote:
I don't really see why you think that this path is going to lead to
better performance than where you were before. Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast, and NOT IN (sub-select) is
Matthew Schumacher wrote:
Matthew Schumacher wrote:
Tom Lane wrote:
I don't really see why you think that this path is going to lead to
better performance than where you were before. Manipulation of the
temp table is never going to be free, and IN (sub-select) is always
inherently not fast,
John A Meinel wrote:
Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
I'm guessing this isn't your last version of the function.
As far as putting the CREATE TEMP TABLE inside the
Matthew Schumacher wrote:
John A Meinel wrote:
Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
I'm guessing this isn't your last version of the function.
As far as putting the CREATE TEMP
Jim C. Nasby wrote:
I'm not sure who's responsible for DBI::Pg (Josh?), but would it make
sense to add better support for bytea to DBI::Pg? ISTM there should be a
better way of doing this than adding gobs of \'s.
It has support for binding a bytea parameter, but in this case we're
trying to
John Arbash Meinel wrote:
Matthew Schumacher wrote:
All it's doing is trying the update before the insert to get around the
problem of not knowing which is needed. With only 2-3 of the queries
implemented I'm already back to running about the same speed as the
original SA proc that is going
Ok, here is the current plan.
Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table using copy, select the tokens distinct
into the token table for new tokens, update the token
On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
Ok, here is the current plan.
Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table using copy, select
Jim C. Nasby wrote:
On Sun, Jul 31, 2005 at 08:51:06AM -0800, Matthew Schumacher wrote:
Ok, here is the current plan.
Change the spamassassin API to pass a hash of tokens into the storage
module, pass the tokens to the proc as an array, start a transaction,
load the tokens into a temp table
Michael Parker [EMAIL PROTECTED] writes:
sub bytea_esc {
my ($str) = @_;
my $buf = ;
foreach my $char (split(//,$str)) {
if (ord($char) == 0) { $buf .= 000; }
elsif (ord($char) == 39) { $buf .= 047; }
elsif (ord($char) == 92) { $buf .= 134; }
else { $buf
Matthew Schumacher [EMAIL PROTECTED] writes:
One thing that is still very puzzling to me is why this runs so much
slower when I put the data.sql in a transaction. Obviously transactions
are acting different when you call a proc a zillion times vs an insert
query.
I looked into this a bit.
Tom Lane wrote:
I looked into this a bit. It seems that the problem when you wrap the
entire insertion series into one transaction is associated with the fact
that the test does so many successive updates of the single row in
bayes_vars. (VACUUM VERBOSE at the end of the test shows it
Matthew Schumacher wrote:
All it's doing is trying the update before the insert to get around the
problem of not knowing which is needed. With only 2-3 of the queries
implemented I'm already back to running about the same speed as the
original SA proc that is going to ship with SA 3.1.0.
All of
Matthew Schumacher wrote:
Tom Lane wrote:
I looked into this a bit. It seems that the problem when you wrap the
entire insertion series into one transaction is associated with the fact
that the test does so many successive updates of the single row in
bayes_vars. (VACUUM VERBOSE at the end
zOn Thu, 28 Jul 2005, Matthew Schumacher wrote:
Gavin Sherry wrote:
I had a look at your data -- thanks.
I have a question though: put_token() is invoked 120596 times in your
benchmark... for 616 messages. That's nearly 200 queries (not even
counting the 1-8 (??) inside the function
On Wed, 27 Jul 2005, Matthew Schumacher wrote:
Then they do this to insert the token:
INSERT INTO bayes_token (
id,
token,
spam_count,
ham_count,
atime
) VALUES (
?,
?,
?,
?,
?
) ON DUPLICATE KEY
UPDATE
spam_count = GREATEST(spam_count + ?, 0),
Dennis,
EXCEPTION WHEN unique_violation THEN
I seem to remember that catching an exception in a PL/pgSQL procedure was a
large performance cost. It'd be better to do UPDATE ... IF NOT FOUND.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Andrew McMillan wrote:
On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
Ok, I finally got some test data together so that others can test
without installing SA.
The schema and test dataset is over at
http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
I have a pretty fast
Josh Berkus wrote:
Dennis,
EXCEPTION WHEN unique_violation THEN
I seem to remember that catching an exception in a PL/pgSQL procedure was a
large performance cost. It'd be better to do UPDATE ... IF NOT FOUND.
Actually, he was doing an implicit UPDATE IF NOT FOUND in
Also, this test goes a bit faster with sync turned off, if mysql isn't
using sync that would be why it's so much faster. Anyone know what the
default for mysql is?
For InnoDB I think it's like Postgres (only slower) ; for MyISAM it's no
fsync, no transactions, no crash tolerance of any
On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
On my laptop this takes:
real1m33.758s
user0m4.285s
sys 0m1.181s
One interesting effect is the data in bayes_vars has a huge number of
updates and needs vacuum _frequently_. After the run a vacuum full
Andrew McMillan [EMAIL PROTECTED] writes:
On Fri, 2005-07-29 at 09:37 -0800, Matthew Schumacher wrote:
How often should this table be vacuumed, every 5 minutes?
I would be tempted to vacuum after each e-mail, in this case.
Perhaps the bulk of the transient states should be done in a temp
Ok, here is where I'm at, I reduced the proc down to this:
CREATE FUNCTION update_token (_id INTEGER,
_token BYTEA,
_spam_count INTEGER,
_ham_count INTEGER,
_atime INTEGER)
Matthew Schumacher [EMAIL PROTECTED] writes:
After playing with various indexes and what not I simply am unable to
make this procedure perform any better. Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.
On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.
Do you have sample data too?
--
Karim Nassar
Collaborative Computing Lab of NAU
Office: (928) 523 5868 -=-
I'm not sure how much this has been discussed on the list, but wasn't
able to find anything relevant in the archives.
The new Spamassassin is due out pretty soon. They are currently
testing
3.1.0pre4. One of the things I hope to get out of this release is
bayes
word stats moved to a real
Karim Nassar wrote:
On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.
Do you have sample data too?
Ok, I finally got some test data together so that
On Thu, 28 Jul 2005, Matthew Schumacher wrote:
Karim Nassar wrote:
On Wed, 2005-07-27 at 14:35 -0800, Matthew Schumacher wrote:
I put the rest of the schema up at
http://www.aptalaska.net/~matt.s/bayes/bayes_pg.sql in case someone
needs to see it too.
Do you have sample data too?
Gavin Sherry wrote:
I had a look at your data -- thanks.
I have a question though: put_token() is invoked 120596 times in your
benchmark... for 616 messages. That's nearly 200 queries (not even
counting the 1-8 (??) inside the function itself) per message. Something
doesn't seem right
On Thu, 2005-07-28 at 16:13 -0800, Matthew Schumacher wrote:
Ok, I finally got some test data together so that others can test
without installing SA.
The schema and test dataset is over at
http://www.aptalaska.net/~matt.s/bayes/bayesBenchmark.tar.gz
I have a pretty fast machine with a
Matt,
After playing with various indexes and what not I simply am unable to
make this procedure perform any better. Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.
Well, my first thought is that this is
Josh Berkus wrote:
Matt,
After playing with various indexes and what not I simply am unable to
make this procedure perform any better. Perhaps someone on the list can
spot the bottleneck and reveal why this procedure isn't performing that
well or ways to make it better.
Well, my first
Matt,
UPDATE bayes_vars SET
$token_count_update
newest_token_age = GREATEST(newest_token_age, ?),
oldest_token_age = LEAST(oldest_token_age, ?)
WHERE id = ?
I think the reason why the procedure was written for postgres was
because of the greatest and least statements performing
Josh Berkus wrote:
Matt,
Well, it might be because we don't have a built-in GREATEST or LEAST prior to
8.1. However, it's pretty darned easy to construct one.
I was more talking about min() and max() but yea, I think you knew where
I was going with it...
Well, there's the general
42 matches
Mail list logo