Re: Bayes schema questions

Sidney Markowitz 11 Mar 2004 13:05:11 -0000

Michael Parker wrote:

Kelsey will have to correct me if I'm wrong, but he's not seeing the
high I/O with the MySQL bayes storage, he's seeing it with the DB_File
solution.

Oh, I read back over his earlier email. I was confusing DB_File and DBI, thinking we were talking about overhead imposed by using the DBI abstraction layer to get to MySQL. Whoops... Yes, his I/O problems are using DB_File and he hasn't retested with MySQL yet.

If all we need to fix the I/O requirements problem is switching to MySQL I'll be ecstatic. But I'm worried...

The index would have to be loaded into memory, not the entire table.
The index on bayes_token should be fairly efficient. Given enough
memory MySQL can hold a good bit of the index in memory. Practically
every single one of my queries is served out of memory. Recently I've
been peaking somewhere around 200 queries per second, with performance
about equal, if not slightly better than DB_File on the same hardware.

My concern is what happens when you scale up the number of users. You are getting every one of your queries out of memory but:

1. How many megabytes are being loaded into memory to get that? With many users, each message that is processed will be for a different user. That initial load of index data into memory will have to be done once per message, not just once as in your tests. Is the index file for bayes_token table something like 4 or 5 megabytes of data for one user?

2. Your entire bayes database may end up cached in memory by the time you have processed one message. Everything you are doing would be from memory after that. With many users, the bayes_token records for any one user will be scattered all over the table. Each query will result in a new seek and reading of an entire block. How many queries are there per message? If its 1000, that's something like another 4 megabytes read per message.

Reducing the size of the index helps #1. Finding a way to get some locality of reference for the bayes data for a single user, as well as reducing the amount of data, will help #2.

I think performance will not be acceptable if each message requires reading on the order of 7 or 8 megabytes from disk.

I think you are trying to solve a problem that doesn't need
to be solved.

I hope so. I was mistaken that Kelsey was running MySQL, so maybe that's all he needs to fix things. But based on what I said above, I'm worried.

An integer uid would mean you'd have to
join to the userpref table, which I don't think
you'd want to do unless you had to.

No, not a join, you just get the user's uid at the start of processing and then use that instead of a username for everything.

 -- sidney

Re: Bayes schema questions

Reply via email to