I hope I'm doing this right. This is my first time using gmane.
Pretty slick stuff overall. I like it!



Jason R. Mastaler wrote:

> Jesse Guardiani <[EMAIL PROTECTED]> writes:
> 
>> I see the potential in the future for companies using CDB whitelists
>> to need this functionality when the CDB whitelists begin taking too
>> long to rebuild.
> 
> How large are we talking here?  CDB scales better than you might
> think.

LARGE. Unfortunately, I don't want to get into details, as my local
competition may be subscribed to this list.

But the size of the CDB isn't the whole equation. The thing that concerns
me is how often it'll get updated. Sure, it may only take a fraction of
a second to rebuild, but what happens when 10 programs are trying to rebuild
it at once? And 30 are trying to read it?

Even so, after thinking and thinking and thinking, I and my employer
came up with a solution that will allow us to continue using CDBs
rather than MySQL almost indefinately.

But I DO still think that MySQL support is a good idea.

It gives all of the advantages of CDB (quick lookups, small size),
and non of the disadvantages (slow to update, many concurrent updates
create an always-open condition and CDB no longer readable).




> 
> First, cdbmake insures that the .cdb is updated atomically, so
> programs reading the .cdb never have to wait for cdbmake to finish
> rebuilding.  cdbmake is also extremely fast.  So, there really isn't
> any penalty for a rebuild until you get into some really massive
> whitelists.
> 
> I wanted some empirical evidence to support this claim though, so I
> did some measurements of how long it takes TMDA to rebuild .cdb files
> of various sizes.
> 
> I also realized that you will save time if you rebuild the .cdb
> yourself using faster tools, and then use 'from-cdb' instead of
> 'from-file -autocdb'. The question is, at what point is it necessary
> to give up the convenience of -autocdb? So, I benchmarked this too.
> 
> I compared TMDA's Util.build_cdb() to a C program I wrote called
> 'cdbrecords'. cdbrecords simply reads in the list file, converts each
> line to a cdbmake compatible format, and then pipes the results to the
> cdbmake utility (from djb's cdb distribution). e.g,
> 
> $ time python2.2 -c "import Util;Util.build_cdb('1M')"
> 
>   vs.
> 
> $ time ./cdbrecords 1M | cdbmake 1M.cdb 1M.cdb.tmp
> 
> The list files I tested contained randomly generated, unique e-mail
> addresses ranging in number from 1,000 to 10,000,000. Here are the
> results:
> 
> # of uniq addresses  TMDA/Python     C/cdbmake
> ===================  ===========     =========
> 1K                   0m0.602s        0m0.030s
> 10K                  0m0.470s        0m0.059s
> 100K                 0m3.794s        0m0.322s
> 1M                   0m39.039s       0m3.062s
> 10M                  3m11.144s       0m39.246s

You didn't list your machine specs or your machine load while doing this
test.

I'll admit that this is MUCH more impressive that I originally thought.

However, if you had 1Mil users in a whitelist, chances are that it'll
be updated AND accessed quite a bit. You may run into problems with file
descriptors being always open - eventually.

That's the only reason why I'd like to see MySQL support.

And no, I don't have any direct experience with this sort of problem. I'm
basing my argument purely on the paper written by Matt Simerson regarding 
frequent updates to a CDB file by vpopmail. You can read it here:

http://matt.simerson.net/computing/mail/qmail/qmail.toaster.open-smtp_writeup.txt

That may never happen with ANY TMDA system, but I don't want to bet on it.

Thanks!


> 
> As expected, the C/cdbmake combo is much faster. With TMDA, there is
> Python overhead to pay when large list files are being read and
> processed before the cdbmake.  The C code is faster of course, but you
> can also tailor it to your needs (e.g, you may not need to strip blank
> lines or look for '#' comments to skip like TMDA does).
> 
> Which route you take is a function of the size of your lists and the
> frequency with which they must be rebuilt, but hopefully these numbers
> will help.

-- 
Jesse Guardiani, Systems Administrator
WingNET Internet Services,
P.O. Box 2605 // Cleveland, TN 37320-2605
423-559-LINK (v)  423-559-5145 (f)
http://www.wingnet.net

We are actively looking for companies that do a lot of long
distance faxing and want to cut their long distance bill by
up to 50%.  Contact [EMAIL PROTECTED] for more info.


_________________________________________________
tmda-workers mailing list ([EMAIL PROTECTED])
http://tmda.net/lists/listinfo/tmda-workers

Reply via email to