Re: [spamdyke-users] Databases revisited

Sam Clippinger Thu, 22 Oct 2009 11:22:16 -0700

Personally, I like the second option (adding options with "-cdb" for CDB 
files) rather than the first one (requiring a specific naming scheme).

I've already implemented CDB support in the code for the next version, 
so spamdyke can read some of qmail's control files for recipient 
validation.  Adding CDB support to other options wouldn't take much 
extra effort.  The big question, of course, is whether it's worth it.

I know DJB says CDB files are the bee's knees but I must say (after 
reading his docs, his source code and writing my own code for spamdyke) 
that I'm not impressed.  I'm sure they're more efficient than text files 
for large amounts of data (hundreds of thousands of entries).  But for 
small data sets (hundreds of entries) I don't believe they're any more 
efficient and for tiny data sets (ten entries) they are hugely 
wasteful.  When you consider the additional headache of having to keep 
the CDB file in sync with the ASCII source, I really don't see the point.

Of course I haven't benchmarked anything, so I could be way off base.  
DJB has a PhD and teaches computer science, I don't.  He probably 
analyzed his hash functions to minimize collisions and compared 
operational complexities and so forth... academics do that kind of stuff 
for fun.  In a nutshell, here's how a CDB file is accessed:
    Calculate hash
    Seek to position within CDB, read 64 bytes of data (primary hash table)
    A few more calculations
    Seek to another position within CDB, read another 64 bytes of data 
(secondary hash table)
    A few more calculations
    Seek to a third position within the CDB, read another 64 bytes of 
data (header entry)
    Compare the header entry to the desired data
    If it matches, seek to a fourth position within the CDB, read the 
data record
    If it does not match, go back to the secondary hash table and look 
in the next "slot" for your data. Repeat until your data is found.

Except for the secondary hash table, which I don't see a need for, this 
describes a textbook hash table from freshman computer science classes.  
The seek/read operations are the most expensive operations (the math 
takes no time at all) because they require the program to wait for 
access to a spinning disk.  If everything goes well and there are no 
hash collisions, reading a single entry from a CDB file requires 4 
separate seek/read operations within the file.  If things go badly and 
there are hash collisions, reading an entry from a CDB file may take 
many more read/seek operations (theoretically it could read the entire 
file).  By comparison, when spamdyke reads a text file, it loads 64 KB 
at a time (if possible) and parses the lines in memory.  This is a win 
when the file is small or the entry is near the beginning.  It's a huge 
win when the file is tiny (like most /etc/tcp.smtp files).

So I said all that to say this: I don't personally believe CDB files 
live up to the hype, nor do I believe they solve any real-world problems 
(they're still binary formats, they can't be shared between servers, 
etc) but if people want them I can support them.

-- Sam Clippinger

[email protected] wrote:
> Dear all,
>
> I have been reading up on the discussions on this list as well as the
> concerns about databases in the FAQ. Whilst I concur with most of the
> points wrt. to a fully fledged SQL database, I think that CDBs are
> ideally suited for the purposes of spamdyke. Sam states in the FAQ
> that speed, memory, concurrency, portability and availability are not
> a concern with CDBs and I agree, especially on the speed issue. After
> all, that was what the hash file format was designed for. 
>
> That leaves accessibility and safety for CDBs. It is true that the
> database itself is in binary form (that is where the speed comes
> from), which means that they cannot be easily viewed and checked for
> errors. At the same time, they are read only and are usually generated
> from a plain text file as input. There is no reason to not have that
> text file sitting next to the actual database file, which means we
> have all the advantages of a plain text file plus the speed benefit of
> CDBs, which can be substantial for a lot of entries. The only
> additional step required (by the admin) would be to convert the text
> file into the CDB. We could also have the best of both worlds like
> this. Suppose we have this entry in the configuration file:
>
> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
>
>
> First, we look for a file with the name
> /etc/spamdyke/recipient-blacklist.cdb. If it exists, we assume it is a
> CDB version of /etc/spamdyke/recipient-blacklist and look up whatever
> we need there. If recipient-blacklist.cdb has an earlier modification
> time than recipient-blacklist (we get that for free anyway with a
> stat() on both files), we could help the admin by printing a warning
> that the CDB is probably out of date and read from recipient-blacklist
> instead. If recipient-blacklist.cdb does not exist, we use
> recipient-blacklist in ASCII format like before.
>
>
> Another version of this would be to have lots of new configuration
> options like:
>
> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>
> That makes it possible to name the database file arbitrarily. If we
> want the safety checks like in the example above we could make it
> mandatory to name the ASCII input file for the CDB database file:
>
> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>
> That way all the fallbacks to ASCII plus warnings can be implemented at
> the cost of more configuration entries.
>
>
> What do you think?
>
>   
_______________________________________________
spamdyke-users mailing list
[email protected]
http://www.spamdyke.org/mailman/listinfo/spamdyke-users

Re: [spamdyke-users] Databases revisited

Reply via email to