Re: [spamdyke-users] Databases revisited

Michael Colvin Thu, 22 Oct 2009 13:04:27 -0700

After looking into QMT, which has recipient validation built in, I'm not
sure Spamdyke really needs it...  The implementation in QMT allows for
VPOPmail and non-VPOPmail qmail servers to easily validate recipients.  If
Spamdyke implemented a version based on cdb files, with VPOPmail servers,
something would have to be put in place to build those cdb files from the
database.


Spamdyke is fantastic at what it does.  I'm not sure that it needs to be
complicated.  Of course, as long as the validation is easy enough to
disable, then I guess it wouldn't matter, and non-VPOPmail users could
enable it and use the cdb files...  If Spamdyke included the ability to
validate against the VPOPmail database, I'm not sure it would be any more or
less efficient than the patch that's included in QMT.  Eric?

Of course, adding it to Spamdyke gives people the ability to add that
additional functionality to a completely stock Qmail server (As opposed to a
toaster), without doing ANY patches, which might be useful in some
instances, I suppose....


 
Michael J. Colvin
NorCal Internet Services
www.norcalisp.com
 



> -----Original Message-----
> From: [email protected] [mailto:spamdyke-users-
> [email protected]] On Behalf Of Eric Shubert
> Sent: Thursday, October 22, 2009 11:41 AM
> To: [email protected]
> Subject: Re: [spamdyke-users] Databases revisited
> 
> Nice piece, Sam.
> 
> In addition, the OS will likely have cached spamdyke's config file(s)
> anyhow, so I expect any real performance gain would be negligible.
> 
> BL to me is that there are a host of other inefficiencies (pardon the
> pun) that would bring a mail server to its knees long before
> optimization of spamdyke's config files could provide any relief.
> 
> Sam Clippinger wrote:
> > Personally, I like the second option (adding options with "-cdb" for CDB
> > files) rather than the first one (requiring a specific naming scheme).
> >
> > I've already implemented CDB support in the code for the next version,
> > so spamdyke can read some of qmail's control files for recipient
> > validation.  Adding CDB support to other options wouldn't take much
> > extra effort.  The big question, of course, is whether it's worth it.
> >
> > I know DJB says CDB files are the bee's knees but I must say (after
> > reading his docs, his source code and writing my own code for spamdyke)
> > that I'm not impressed.  I'm sure they're more efficient than text files
> > for large amounts of data (hundreds of thousands of entries).  But for
> > small data sets (hundreds of entries) I don't believe they're any more
> > efficient and for tiny data sets (ten entries) they are hugely
> > wasteful.  When you consider the additional headache of having to keep
> > the CDB file in sync with the ASCII source, I really don't see the
> point.
> >
> > Of course I haven't benchmarked anything, so I could be way off base.
> > DJB has a PhD and teaches computer science, I don't.  He probably
> > analyzed his hash functions to minimize collisions and compared
> > operational complexities and so forth... academics do that kind of stuff
> > for fun.  In a nutshell, here's how a CDB file is accessed:
> >     Calculate hash
> >     Seek to position within CDB, read 64 bytes of data (primary hash
> table)
> >     A few more calculations
> >     Seek to another position within CDB, read another 64 bytes of data
> > (secondary hash table)
> >     A few more calculations
> >     Seek to a third position within the CDB, read another 64 bytes of
> > data (header entry)
> >     Compare the header entry to the desired data
> >     If it matches, seek to a fourth position within the CDB, read the
> > data record
> >     If it does not match, go back to the secondary hash table and look
> > in the next "slot" for your data. Repeat until your data is found.
> >
> > Except for the secondary hash table, which I don't see a need for, this
> > describes a textbook hash table from freshman computer science classes.
> > The seek/read operations are the most expensive operations (the math
> > takes no time at all) because they require the program to wait for
> > access to a spinning disk.  If everything goes well and there are no
> > hash collisions, reading a single entry from a CDB file requires 4
> > separate seek/read operations within the file.  If things go badly and
> > there are hash collisions, reading an entry from a CDB file may take
> > many more read/seek operations (theoretically it could read the entire
> > file).  By comparison, when spamdyke reads a text file, it loads 64 KB
> > at a time (if possible) and parses the lines in memory.  This is a win
> > when the file is small or the entry is near the beginning.  It's a huge
> > win when the file is tiny (like most /etc/tcp.smtp files).
> >
> > So I said all that to say this: I don't personally believe CDB files
> > live up to the hype, nor do I believe they solve any real-world problems
> > (they're still binary formats, they can't be shared between servers,
> > etc) but if people want them I can support them.
> >
> > -- Sam Clippinger
> >
> > [email protected] wrote:
> >> Dear all,
> >>
> >> I have been reading up on the discussions on this list as well as the
> >> concerns about databases in the FAQ. Whilst I concur with most of the
> >> points wrt. to a fully fledged SQL database, I think that CDBs are
> >> ideally suited for the purposes of spamdyke. Sam states in the FAQ
> >> that speed, memory, concurrency, portability and availability are not
> >> a concern with CDBs and I agree, especially on the speed issue. After
> >> all, that was what the hash file format was designed for.
> >>
> >> That leaves accessibility and safety for CDBs. It is true that the
> >> database itself is in binary form (that is where the speed comes
> >> from), which means that they cannot be easily viewed and checked for
> >> errors. At the same time, they are read only and are usually generated
> >> from a plain text file as input. There is no reason to not have that
> >> text file sitting next to the actual database file, which means we
> >> have all the advantages of a plain text file plus the speed benefit of
> >> CDBs, which can be substantial for a lot of entries. The only
> >> additional step required (by the admin) would be to convert the text
> >> file into the CDB. We could also have the best of both worlds like
> >> this. Suppose we have this entry in the configuration file:
> >>
> >> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
> >>
> >>
> >> First, we look for a file with the name
> >> /etc/spamdyke/recipient-blacklist.cdb. If it exists, we assume it is a
> >> CDB version of /etc/spamdyke/recipient-blacklist and look up whatever
> >> we need there. If recipient-blacklist.cdb has an earlier modification
> >> time than recipient-blacklist (we get that for free anyway with a
> >> stat() on both files), we could help the admin by printing a warning
> >> that the CDB is probably out of date and read from recipient-blacklist
> >> instead. If recipient-blacklist.cdb does not exist, we use
> >> recipient-blacklist in ASCII format like before.
> >>
> >>
> >> Another version of this would be to have lots of new configuration
> >> options like:
> >>
> >> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
> >>
> >> That makes it possible to name the database file arbitrarily. If we
> >> want the safety checks like in the example above we could make it
> >> mandatory to name the ASCII input file for the CDB database file:
> >>
> >> recipient-blacklist-file=/etc/spamdyke/recipient-blacklist
> >> recipient-blacklist-file-cdb=/etc/spamdyke/recipient-blacklist.cdb
> >>
> >> That way all the fallbacks to ASCII plus warnings can be implemented at
> >> the cost of more configuration entries.
> >>
> >>
> >> What do you think?
> >>
> >>
> 
> 
> --
> -Eric 'shubes'
> 
> _______________________________________________
> spamdyke-users mailing list
> [email protected]
> http://www.spamdyke.org/mailman/listinfo/spamdyke-users

_______________________________________________
spamdyke-users mailing list
[email protected]
http://www.spamdyke.org/mailman/listinfo/spamdyke-users

Re: [spamdyke-users] Databases revisited

Reply via email to