Personally, I like the second option (adding options with "-cdb" for CDB
files) rather than the first one (requiring a specific naming scheme).
I've already implemented CDB support in the code for the next version,
so spamdyke can read some of qmail's control files for recipient
validation. Adding CDB support to other options wouldn't take much
extra effort. The big question, of course, is whether it's worth it.
I know DJB says CDB files are the bee's knees but I must say (after
reading his docs, his source code and writing my own code for spamdyke)
that I'm not impressed. I'm sure they're more efficient than text files
for large amounts of data (hundreds of thousands of entries). But for
small data sets (hundreds of entries) I don't believe they're any more
efficient and for tiny data sets (ten entries) they are hugely
wasteful. When you consider the additional headache of having to keep
the CDB file in sync with the ASCII source, I really don't see the point.
Of course I haven't benchmarked anything, so I could be way off base.
DJB has a PhD and teaches computer science, I don't. He probably
analyzed his hash functions to minimize collisions and compared
operational complexities and so forth... academics do that kind of stuff
for fun. In a nutshell, here's how a CDB file is accessed:
Calculate hash
Seek to position within CDB, read 64 bytes of data (primary hash table)
A few more calculations
Seek to another position within CDB, read another 64 bytes of data
(secondary hash table)
A few more calculations
Seek to a third position within the CDB, read another 64 bytes of
data (header entry)
Compare the header entry to the desired data
If it matches, seek to a fourth position within the CDB, read the
data record
If it does not match, go back to the secondary hash table and look
in the next "slot" for your data. Repeat until your data is found.
Except for the secondary hash table, which I don't see a need for, this
describes a textbook hash table from freshman computer science classes.
The seek/read operations are the most expensive operations (the math
takes no time at all) because they require the program to wait for
access to a spinning disk. If everything goes well and there are no
hash collisions, reading a single entry from a CDB file requires 4
separate seek/read operations within the file. If things go badly and
there are hash collisions, reading an entry from a CDB file may take
many more read/seek operations (theoretically it could read the entire
file). By comparison, when spamdyke reads a text file, it loads 64 KB
at a time (if possible) and parses the lines in memory. This is a win
when the file is small or the entry is near the beginning. It's a huge
win when the file is tiny (like most /etc/tcp.smtp files).
So I said all that to say this: I don't personally believe CDB files
live up to the hype, nor do I believe they solve any real-world problems
(they're still binary formats, they can't be shared between servers,
etc) but if people want them I can support them.
-- Sam Clippinger
[email protected] wrote:
> Dear all,
>
> I have been reading up on the discussions on this list as well as the
> concerns about databases in the FAQ. Whilst I concur with most of the
> points wrt. to a fully fledged SQL database, I think that CDBs are
> ideally suited for the purposes of spamdyke. Sam states in the FAQ
> that speed, memory, concurrency, portability and availability are not
> a concern with CDBs and I agree, especially on the speed issue. After
> all, that was what the hash file format was designed for.
>
> That leaves accessibility and safety for CDBs. It is true that the
> database itself is in binary form (that is where the speed comes
> from), which means that they cannot be easily viewed and checked for
> errors. At the same time, they are read only and are usually generated
> from a plain text file as input. There is no reason to not have that
> text file sitting next to the actual database file, which means we
> have all the advantages of a plain text file plus the speed benefit of
> CDBs, which can be substantial for a lot of entries. The only
> additional step required (by the admin) would be to convert the text
> file into the CDB. We could also have the best of both worlds like
> this. Suppose we have this entry in the configuration file:
>
> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
>
>
> First, we look for a file with the name
> /etc/spamdyke/recipient-blacklist.cdb. If it exists, we assume it is a
> CDB version of /etc/spamdyke/recipient-blacklist and look up whatever
> we need there. If recipient-blacklist.cdb has an earlier modification
> time than recipient-blacklist (we get that for free anyway with a
> stat() on both files), we could help the admin by printing a warning
> that the CDB is probably out of date and read from recipient-blacklist
> instead. If recipient-blacklist.cdb does not exist, we use
> recipient-blacklist in ASCII format like before.
>
>
> Another version of this would be to have lots of new configuration
> options like:
>
> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>
> That makes it possible to name the database file arbitrarily. If we
> want the safety checks like in the example above we could make it
> mandatory to name the ASCII input file for the CDB database file:
>
> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
>
> That way all the fallbacks to ASCII plus warnings can be implemented at
> the cost of more configuration entries.
>
>
> What do you think?
>
>
_______________________________________________
spamdyke-users mailing list
[email protected]
http://www.spamdyke.org/mailman/listinfo/spamdyke-users