I've been using Matt Cowles's x-lookup-ip extension with some success recently to reveal the real IP addresses behind spammers' hostnames. For example, the following hostnames are mentioned in pharma come-ons:
% host www.astlehover.com www.astlehover.com has address 211.144.68.87 % host www.tornetseen.com www.tornetseen.com has address 211.144.68.87 % host www.erlikuvera.com www.erlikuvera.com has address 211.144.68.87 % host www.oplimazexu.com www.oplimazexu.com has address 211.144.68.87 The rest of the message content is pretty well disguised (very little content, random common text boilerplate, etc), so without IP lookup they tend to plop into my unsure mailbox. They sometimes score low enough to land in my regular inbox. Matt's extension solves that by looking up the IP addresses for hosts it encounters and generating a number of new tokens: % spamcounts -r :211 token,nspam,nham,spam prob url-ip:211.144.68.87/32,1,0,0.844827586207 url-ip:211.144.68/24,1,0,0.844827586207 url-ip:211/8,4,0,0.949438202247 url-ip:211.20.189/24,1,0,0.844827586207 url-ip:211.189.18/24,1,0,0.844827586207 url-ip:211.144/16,1,0,0.844827586207 received:211.95.72.130,1,0,0.844827586207 url-ip:211.189.18.186/32,1,0,0.844827586207 url-ip:211.22.166.116/32,1,0,0.844827586207 received:211.96,1,0,0.844827586207 received:211.95,1,0,0.844827586207 url-ip:211.22.166/24,1,0,0.844827586207 received:211.95.72,1,0,0.844827586207 url-ip:211.20/16,1,0,0.844827586207 url-ip:211.20.189.50/32,1,0,0.844827586207 received:211.96.42,1,0,0.844827586207 url-ip:211.22/16,1,0,0.844827586207 received:211,2,0,0.908163265306 received:211.96.42.103,1,0,0.844827586207 url-ip:211.189/16,1,0,0.844827586207 Unfortunately it doesn't cache IP addresses across sessions. My train-to-exhaustion scheme scores my entire training database. The first round of scoring is very time-consuming. I decided to solve that shortcoming. I added "dbm" and "zodb" support to Matt's dnscache module, since those are probably the two most prevalent storage schemes (default and emeritus default). I've been testing the zodb scheme but having trouble with it. If I start with no ~/.dnscache* files it correctly creates a new one. If I have an existing database already, it doesn't update the database file, though the timestamps on the .index and .tmp files are updated. I asked on zodb-dev and got some partial help (I was relying on __del__ to close() the FileStorage object), but even with that fixed it's not working properly. My recent pleas for help have gone unanswered, so I'm turning to this list. My zodb code was cribbed from the support in SpamBayes itself, so maybe the author of that code will see what I've done wrong. I set up the cache in tokenizer.py like so: try: import dnscache cache = dnscache.cache(cachefile=os.path.expanduser("~/.dnscache")) cache.printStatsAtEnd = True except (IOError, ImportError): cache = None else: import atexit atexit.register(cache.close) In the cache class's __init__ I open the cachefile if given: if cachefile: self.open_cachefile(cachefile) else: self.caches={ "A": {}, "PTR": {} } def open_cachefile(self, cachefile): filetype = options["Storage", "persistent_use_database"] cachefile = os.path.expanduser(cachefile) if filetype == "dbm": if os.path.exists(cachefile): self.caches=shelve.open(cachefile) else: self.caches=shelve.open(cachefile) self.caches["A"] = {} self.caches["PTR"] = {} elif filetype == "zodb": from ZODB import DB from ZODB.FileStorage import FileStorage self._zodb_storage = FileStorage(cachefile, read_only=False) self._DB = DB(self._zodb_storage, cache_size=10000) self._conn = self._DB.open() root = self._conn.root() self.caches = root.get("dnscache") if self.caches is None: # There is no classifier, so create one. from BTrees.OOBTree import OOBTree self.caches = root["dnscache"] = OOBTree() self.caches["A"] = {} self.caches["PTR"] = {} print "opened new cache" else: print "opened existing cache with", len(self.caches["A"]), "A records", print "and", len(self.caches["PTR"]), "PTR records" and when it's closed, this code executes: def close(self): filetype = options["Storage", "persistent_use_database"] if filetype == "dbm": self.caches.close() elif filetype == "zodb": self._zodb_close() def _zodb_store(self): import transaction from ZODB.POSException import ConflictError from ZODB.POSException import TransactionFailedError try: transaction.commit() except ConflictError, msg: # We'll save it next time, or on close. It'll be lost if we # hard-crash, but that's unlikely, and not a particularly big # deal. if options["globals", "verbose"]: print >> sys.stderr, "Conflict on commit.", msg transaction.abort() except TransactionFailedError, msg: # Saving isn't working. Try to abort, but chances are that # restarting is needed. if options["globals", "verbose"]: print >> sys.stderr, "Store failed. Need to restart.", msg transaction.abort() def _zodb_close(self): # Ensure that the db is saved before closing. Alternatively, we # could abort any waiting transaction. We need to do *something* # with it, though, or it will be still around after the db is # closed and cause problems. For now, saving seems to make sense # (and we can always add abort methods if they are ever needed). self._zodb_store() # Do the closing. self._DB.close() # We don't make any use of the 'undo' capabilities of the # FileStorage at the moment, so might as well pack the database # each time it is closed, to save as much disk space as possible. # Pack it up to where it was 'yesterday'. # XXX What is the 'referencesf' parameter for pack()? It doesn't # XXX seem to do anything according to the source. ## self._zodb_storage.pack(time.time()-60*60*24, None) self._zodb_storage.close() self._zodb_closed = True if options["globals", "verbose"]: print >> sys.stderr, 'Closed dnscache database' When run, it correctly announces that it's either creating a new cache or that it opened an existing cache, e.g.: opened existing cache with 479 A records and 0 PTR records No errors appear on stdout or stderr during the run. At completion it tells me that, "Closed dnscache database". I can see that the database isn't getting updated because a) its timestamp doesn't get updated and b) because running strings over the file and grepping for new names doesn't display them: % # this one exists... % strings -a ~/.dnscache* | egrep -i timsblogger www.timsbloggers.comq % # this one is new... % strings -a ~/.dnscache* | egrep -i tradelink % # bummer... Does anyone have any suggestions about getting this beast to work properly? Thx, Skip _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev