Jesse Guardiani <[EMAIL PROTECTED]> writes:

> I see the potential in the future for companies using CDB whitelists
> to need this functionality when the CDB whitelists begin taking too
> long to rebuild.

How large are we talking here?  CDB scales better than you might
think.  

First, cdbmake insures that the .cdb is updated atomically, so
programs reading the .cdb never have to wait for cdbmake to finish
rebuilding.  cdbmake is also extremely fast.  So, there really isn't
any penalty for a rebuild until you get into some really massive
whitelists.

I wanted some empirical evidence to support this claim though, so I
did some measurements of how long it takes TMDA to rebuild .cdb files
of various sizes.

I also realized that you will save time if you rebuild the .cdb
yourself using faster tools, and then use 'from-cdb' instead of
'from-file -autocdb'. The question is, at what point is it necessary
to give up the convenience of -autocdb? So, I benchmarked this too.

I compared TMDA's Util.build_cdb() to a C program I wrote called
'cdbrecords'. cdbrecords simply reads in the list file, converts each
line to a cdbmake compatible format, and then pipes the results to the
cdbmake utility (from djb's cdb distribution). e.g,

$ time python2.2 -c "import Util;Util.build_cdb('1M')"

  vs.

$ time ./cdbrecords 1M | cdbmake 1M.cdb 1M.cdb.tmp

The list files I tested contained randomly generated, unique e-mail
addresses ranging in number from 1,000 to 10,000,000. Here are the
results:

# of uniq addresses  TMDA/Python     C/cdbmake
===================  ===========     =========
1K                   0m0.602s        0m0.030s
10K                  0m0.470s        0m0.059s
100K                 0m3.794s        0m0.322s
1M                   0m39.039s       0m3.062s
10M                  3m11.144s       0m39.246s

As expected, the C/cdbmake combo is much faster. With TMDA, there is
Python overhead to pay when large list files are being read and
processed before the cdbmake.  The C code is faster of course, but you
can also tailor it to your needs (e.g, you may not need to strip blank
lines or look for '#' comments to skip like TMDA does).

Which route you take is a function of the size of your lists and the
frequency with which they must be rebuilt, but hopefully these numbers
will help.
_________________________________________________
tmda-workers mailing list ([EMAIL PROTECTED])
http://tmda.net/lists/listinfo/tmda-workers

Reply via email to