Re: Bayes column 'token'

2006-11-21 Thread Jonas Eckerman

CREATE TABLE bayes_token (
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;



PRIMARY for `id` and `token` should not have INDEX for `id` and `token`
added, too.


Why not?

IIRC the three indexes above makes perfect sense. Like this:

WHERE id=xxx AND token=xxx will use the primary index.

WHERE token=xxx will use the bayes_token_idx1 index.

WHERE id=xxx AND atime=xxx will use the bayes_token_idx2 index.

Again IIRC, the clause WHERE token=xxx should be faster with the existance of the 
bayes_token_idx1 index than without it.

Or is it simply that the MySQL bayse store module never queries with token as 
the first column in a WHERE clause?

I might of course completely misremenber this, so rather than trust me I'd 
suggest reading about index optimization or something like that in the current 
MySQL documentation. :-)

Regards
/Jonas
--
Jonas Eckerman, FSDB  Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/



Re: Bayes column 'token'

2006-11-21 Thread Jonas Eckerman

Michael Alan Dorman wrote:


CREATE TABLE bayes_token (
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;



If the primary key was changed to (token, id), it should be able to be
used in the second sort of query as well as the first, no?


Ah. Yes. I missed that.

Yes, if the key was (token, id), there might be no need at all for the 
bayes_token_idx1 index above.

Sorry for my confusion.

I'm no SQL or MySQL guru, so maybe I'm missing something though.

I have no idea how one best optimizes a MySQL table for using WHERE xxx IN 
(yyy) for example, and the module does have
WHERE id = ? AND token IN (...) ...
in a UPDATE statement. Could that be the reason for the separate (token) index?

I'm supposing that whoever wrote the MySQL bayes storage module had a reason 
for doing what they did and that I simply am not good enough at SQL to realize 
what the reason is. :-)

Regards
/Jonas
--
Jonas Eckerman, FSDB  Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/



Re: Bayes column 'token'

2006-11-21 Thread Michael Alan Dorman
On Tue, 21 Nov 2006 13:42:09 +0100
Jonas Eckerman [EMAIL PROTECTED] wrote:

  CREATE TABLE bayes_token (
PRIMARY KEY (id, token),
INDEX bayes_token_idx1 (token),
INDEX bayes_token_idx2 (id, atime)
  ) TYPE=MyISAM;
 
  PRIMARY for `id` and `token` should not have INDEX for `id` and
  `token` added, too.
 
 Why not?
 
 IIRC the three indexes above makes perfect sense. Like this:
 
 WHERE id=xxx AND token=xxx will use the primary index.
 
 WHERE token=xxx will use the bayes_token_idx1 index.
 
 WHERE id=xxx AND atime=xxx will use the bayes_token_idx2 index.
 
 Again IIRC, the clause WHERE token=xxx should be faster with the
 existance of the bayes_token_idx1 index than without it.

If the primary key was changed to (token, id), it should be able to be
used in the second sort of query as well as the first, no? Or is MySQL
not smart enough to recognize that it's got an index it could match on
a prefix basis?

 Or is it simply that the MySQL bayse store module never queries with
 token as the first column in a WHERE clause?

The position of a column in the WHERE clause shouldn't make a
difference whether an index is used; the nature of SQL is such that
WHERE clauses should be reorderable.  I'm a PostgreSQL guy myself, but
I would still be surprised if MySQL were limited in this way.

Mike.


RE: Bayes column 'token'

2006-11-17 Thread Mark
 -Original Message-
 From: Mark [mailto:[EMAIL PROTECTED] 
 Sent: woensdag 15 november 2006 18:15
 To: 'users@spamassassin.apache.org'
 Subject: RE: Bayes column 'token'
 
   Well, bayes_mysql.sql does not specify collation; so, like
   you said, the collation will be your MySQL server-set default. And
   searches in MySQL are case-insensitive by default. Might indeed
   perhaps be a good idea to convert to latin1_bin or some such.
 
  There will be any problem if I convert the current data to the new
  collation?
 
 I see no indication (or reason) in the code that tokens are 
 to be handled in an case-insensitive manner. The opposite, ere.
 So, I'm inclined to say that latin1_bin collation is better.
 I don't wanna be responsible for messing up your database, though. :)
 So I will test this a bit on my Vmware box.

Did the testing; and it works very smooth with latin1_bin.

   PRIMARY for `id` and `token` should not have INDEX for `id`
   and `token` added, too.
 
  I don't understand what you mean.
  The couple (id, token) is PRIMARY, not INDEX...
  Where exactly is the problem?
 
 PRIMARY, like UNIQUE, always implies INDEX, too. So, adding 
 an extra INDEX for `id` and `token` basically gives you a double
 INDEX for them.

There's a double INDEX for `atime` too. So, I'd say, in
bayes_mysql.sql, replace this:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

With:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) COLLATE latin1_bin NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (atime)
) TYPE=MyISAM;

- Mark



Re: Bayes column 'token'

2006-11-17 Thread Stuart Johnston

Mark wrote:

-Original Message-
From: Mark [mailto:[EMAIL PROTECTED] 
Sent: woensdag 15 november 2006 18:15

To: 'users@spamassassin.apache.org'
Subject: RE: Bayes column 'token'


Well, bayes_mysql.sql does not specify collation; so, like
you said, the collation will be your MySQL server-set default. And
searches in MySQL are case-insensitive by default. Might indeed
perhaps be a good idea to convert to latin1_bin or some such.

There will be any problem if I convert the current data to the new
collation?
I see no indication (or reason) in the code that tokens are 
to be handled in an case-insensitive manner. The opposite, ere.

So, I'm inclined to say that latin1_bin collation is better.
I don't wanna be responsible for messing up your database, though. :)
So I will test this a bit on my Vmware box.


Did the testing; and it works very smooth with latin1_bin.


PRIMARY for `id` and `token` should not have INDEX for `id`
and `token` added, too.

I don't understand what you mean.
The couple (id, token) is PRIMARY, not INDEX...
Where exactly is the problem?
PRIMARY, like UNIQUE, always implies INDEX, too. So, adding 
an extra INDEX for `id` and `token` basically gives you a double

INDEX for them.


There's a double INDEX for `atime` too. So, I'd say, in
bayes_mysql.sql, replace this:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

With:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) COLLATE latin1_bin NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (atime)
) TYPE=MyISAM;


Those are multi-column indexes not duplicates.

INDEX bayes_token_idx1 (id, atime)

is NOT the same as:

INDEX bayes_token_idx1 (id)
INDEX bayes_token_idx2 (atime)

Unless you've verified that the SQL used by the Bayes modules doesn't need these indexes, you 
probably shouldn't change these.


(sorry I didn't notice this earlier in the thread)


RE: Bayes column 'token'

2006-11-17 Thread Mark
 -Original Message-
 From: Stuart Johnston [mailto:[EMAIL PROTECTED] 
 Sent: vrijdag 17 november 2006 23:30
 To: users@spamassassin.apache.org
 Subject: Re: Bayes column 'token'
 
  CREATE TABLE bayes_token (
id int(11) NOT NULL default '0',
token char(5) COLLATE latin1_bin NOT NULL default '',
spam_count int(11) NOT NULL default '0',
ham_count int(11) NOT NULL default '0',
atime int(11) NOT NULL default '0',
PRIMARY KEY (id, token),
INDEX bayes_token_idx1 (atime)
  ) TYPE=MyISAM;
 
 Those are multi-column indexes not duplicates.
 
 INDEX bayes_token_idx1 (id, atime)
 
 is NOT the same as:
 
 INDEX bayes_token_idx1 (id)
 INDEX bayes_token_idx2 (atime)

The multi-column INDEX actually *does* appear to be used.
My bad. Though it still doesn't justify the existence of:

INDEX bayes_token_idx1 (token)

- Mark



Re: Bayes column 'token'

2006-11-17 Thread Stuart Johnston

Mark wrote:

-Original Message-
From: Stuart Johnston [mailto:[EMAIL PROTECTED] 
Sent: vrijdag 17 november 2006 23:30

To: users@spamassassin.apache.org
Subject: Re: Bayes column 'token'


CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) COLLATE latin1_bin NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (atime)
) TYPE=MyISAM;

Those are multi-column indexes not duplicates.

INDEX bayes_token_idx1 (id, atime)

is NOT the same as:

INDEX bayes_token_idx1 (id)
INDEX bayes_token_idx2 (atime)


The multi-column INDEX actually *does* appear to be used.
My bad. Though it still doesn't justify the existence of:

INDEX bayes_token_idx1 (token)


I haven't looked at the SQL but does it not do something like: WHERE token = ?


Bayes column 'token'

2006-11-15 Thread Federico Giannici

Last week we migrated our bayes DB from DBM to MySQL.
Now we have upgraded our MySQL server from version 4.0 to 4.1.

Today I found a couple of duplicate index values in the token column 
of bayes_token table.


This field is defined as char(5) with default collation (that is 
latin1_swedish_ci). Is it the correct one?


It seems to me that the token content is some kind of binary data, so 
this column type and collation don't seem correct to me. It is even Case 
Insensitive!


So, please confirm me that they are the correct one.


Thanks.

--
___
__
   |-  [EMAIL PROTECTED]
   |ederico Giannici  http://www.neomedia.it
___


RE: Bayes column 'token'

2006-11-15 Thread Mark
 -Original Message-
 From: Federico Giannici [mailto:[EMAIL PROTECTED] 
 Sent: woensdag 15 november 2006 10:31
 To: users@spamassassin.apache.org
 Subject: Bayes column 'token'
 
 
 Last week we migrated our bayes DB from DBM to MySQL.
 Now we have upgraded our MySQL server from version 4.0 to 4.1.
 
 Today I found a couple of duplicate index values in the 
 token column of bayes_token table.
 
 This field is defined as char(5) with default collation
 (that is latin1_swedish_ci). Is it the correct one?

Well, bayes_mysql.sql does not specify collation; so, like you said, the
collation will be your MySQL server-set default. And searches in MySQL
are case-insensitive by default. Might indeed perhaps be a good idea
to convert to latin1_bin or some such.

There is, btw, now that I look at it, a small bug in:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

PRIMARY for `id` and `token` should not have INDEX for `id` and `token`
added, too.

- Mark



Re: Bayes column 'token'

2006-11-15 Thread Federico Giannici

Mark wrote:

-Original Message-
From: Federico Giannici [mailto:[EMAIL PROTECTED] 
Sent: woensdag 15 november 2006 10:31

To: users@spamassassin.apache.org
Subject: Bayes column 'token'


Last week we migrated our bayes DB from DBM to MySQL.
Now we have upgraded our MySQL server from version 4.0 to 4.1.

Today I found a couple of duplicate index values in the 
token column of bayes_token table.


This field is defined as char(5) with default collation
(that is latin1_swedish_ci). Is it the correct one?


Well, bayes_mysql.sql does not specify collation; so, like you said, the
collation will be your MySQL server-set default. And searches in MySQL
are case-insensitive by default. Might indeed perhaps be a good idea
to convert to latin1_bin or some such.


There will be any problem if I convert the current data to the new 
collation?




There is, btw, now that I look at it, a small bug in:

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

PRIMARY for `id` and `token` should not have INDEX for `id` and `token`
added, too.


I don't understand what you mean.
The couple (id, token) is PRIMARY, not INDEX...
Where exactly is the problem?


Thanks.

--
___
__
   |-  [EMAIL PROTECTED]
   |ederico Giannici  http://www.neomedia.it
___


RE: Bayes column 'token'

2006-11-15 Thread Mark
 -Original Message-
 From: Federico Giannici [mailto:[EMAIL PROTECTED] 
 Sent: woensdag 15 november 2006 17:42
 To: Mark
 Cc: users@spamassassin.apache.org
 Subject: Re: Bayes column 'token'
 
 
  Well, bayes_mysql.sql does not specify collation; so, like
  you said, the collation will be your MySQL server-set default. And
  searches in MySQL are case-insensitive by default. Might indeed
  perhaps be a good idea to convert to latin1_bin or some such.

 There will be any problem if I convert the current data to the new
 collation?

I see no indication (or reason) in the code that tokens are to be handled
in an case-insensitive manner. The opposite, ere. So, I'm inclined to say
that latin1_bin collation is better. I don't wanna be responsible for
messing up your database, though. :) So I will test this a bit on my
Vmware box.

  PRIMARY for `id` and `token` should not have INDEX for `id`
  and `token` added, too.

 I don't understand what you mean.
 The couple (id, token) is PRIMARY, not INDEX...
 Where exactly is the problem?

PRIMARY, like UNIQUE, always implies INDEX, too. So, adding an extra INDEX
for `id` and `token` basically gives you a double INDEX for them.

- Mark