[Spambayes-checkins] SF.net SVN: spambayes: [3156] trunk/website

montanaro Wed, 25 Jul 2007 07:36:02 -0700

Revision: 3156
          http://spambayes.svn.sourceforge.net/spambayes/?rev=3156&view=rev
Author:   montanaro
Date:     2007-07-25 06:51:11 -0700 (Wed, 25 Jul 2007)


Log Message:
-----------
read the file name incorrectly as a misspelling of "prefs changelog" instead of 
"pre sf changelog"!

Added Paths:
-----------
    trunk/website/presfchangelog.ht

Removed Paths:
-------------
    trunk/website/prefschangelog.ht

Deleted: trunk/website/prefschangelog.ht
===================================================================
--- trunk/website/prefschangelog.ht     2007-07-25 13:49:42 UTC (rev 3155)
+++ trunk/website/prefschangelog.ht     2007-07-25 13:51:11 UTC (rev 3156)
@@ -1,905 +0,0 @@
-<h2>Pre-Sourceforge ChangeLog</h2>
-<p>This changelog lists the commits on the spambayes projects before the
-   separate project was set up. See also the 
-<a 
href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0";>old
 CVS repository</a>, but don't forget that it's now out of date, and you 
probably want to be looking at <a 
href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/";>the current 
CVS</a>.
-</p>
-<pre>
-2002-09-06 02:27  tim_one
-
-       * GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
-       cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
-       (1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
-
-       This code has been moved to a new SourceForge project (spambayes).
-       
-2002-09-05 15:37  tim_one
-
-       * classifier.py (1.11):
-
-       Added note about MINCOUNT oddities.
-       
-2002-09-05 14:32  tim_one
-
-       * timtest.py (1.17):
-
-       Added note about word length.
-       
-2002-09-05 13:48  tim_one
-
-       * timtest.py (1.16):
-
-       tokenize_word():  Oops!  This was awfully permissive in what it
-       took as being "an email address".  Tightened that, and also
-       avoided 5-gram'ing of email addresses w/ high-bit characters.
-       
-       false positive percentages
-           0.000  0.000  tied
-           0.000  0.000  tied
-           0.050  0.050  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.050  lost
-           0.075  0.075  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-       
-       won   0 times
-       tied 19 times
-       lost  1 times
-       
-       total unique fp went from 7 to 8
-       
-       false negative percentages
-           0.764  0.691  won
-           0.691  0.655  won
-           0.981  0.945  won
-           1.309  1.309  tied
-           1.418  1.164  won
-           0.873  0.800  won
-           0.800  0.763  won
-           1.163  1.163  tied
-           1.491  1.345  won
-           1.200  1.127  won
-           1.381  1.345  won
-           1.454  1.490  lost
-           1.164  0.909  won
-           0.655  0.582  won
-           0.655  0.691  lost
-           1.163  1.163  tied
-           1.200  1.018  won
-           0.982  0.873  won
-           0.982  0.909  won
-           1.236  1.127  won
-       
-       won  15 times
-       tied  3 times
-       lost  2 times
-       
-       total unique fn went from 260 to 249
-       
-       Note:  Each of the two losses there consist of just 1 msg difference.
-       The wins are bigger as well as being more common, and 260-249 = 11
-       spams no longer sneak by any run (which is more than 4% of the 260
-       spams that used to sneak thru!).
-       
-2002-09-05 11:51  tim_one
-
-       * classifier.py (1.10):
-
-       Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
-       really matter; leaving it alone.
-       
-2002-09-05 10:02  tim_one
-
-       * classifier.py (1.9):
-
-       A now-rare pure win, changing spamprob() to work harder to find more
-       evidence when competing 0.01 and 0.99 clues appear.  Before in the left
-       column, after in the right:
-       
-       false positive percentages
-           0.000  0.000  tied
-           0.000  0.000  tied
-           0.050  0.050  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.075  0.075  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.075  0.025  won
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-       
-       won   1 times
-       tied 19 times
-       lost  0 times
-       
-       total unique fp went from 9 to 7
-       
-       false negative percentages
-           0.909  0.764  won
-           0.800  0.691  won
-           1.091  0.981  won
-           1.381  1.309  won
-           1.491  1.418  won
-           1.055  0.873  won
-           0.945  0.800  won
-           1.236  1.163  won
-           1.564  1.491  won
-           1.200  1.200  tied
-           1.454  1.381  won
-           1.599  1.454  won
-           1.236  1.164  won
-           0.800  0.655  won
-           0.836  0.655  won
-           1.236  1.163  won
-           1.236  1.200  won
-           1.055  0.982  won
-           1.127  0.982  won
-           1.381  1.236  won
-       
-       won  19 times
-       tied  1 times
-       lost  0 times
-       
-       total unique fn went from 284 to 260
-       
-2002-09-04 11:21  tim_one
-
-       * timtest.py (1.15):
-
-       Augmented the spam callback to display spams with low probability.
-       
-2002-09-04 09:53  tim_one
-
-       * Tester.py (1.3), timtest.py (1.14):
-
-       Added support for simple histograms of the probability distributions for
-       ham and spam.
-       
-2002-09-03 12:13  tim_one
-
-       * timtest.py (1.13):
-
-       A reluctant "on principle" change no matter what it does to the stats:
-       take a stab at removing HTML decorations from plain text msgs.  See
-       comments for why it's *only* in plain text msgs.  This puts an end to
-       false positives due to text msgs talking *about* HTML.  Surprisingly, it
-       also gets rid of some false negatives.  Not surprisingly, it introduced
-       another small class of false positives due to the dumbass regexp trick
-       used to approximate HTML tag removal removing pieces of text that had
-       nothing to do with HTML tags (e.g., this happened in the middle of a
-       uuencoded .py file in such a why that it just happened to leave behind
-       a string that "looked like" a spam phrase; but before this it looked
-       like a pile of "too long" lines that didn't generate any tokens --
-       it's a nonsense outcome either way).
-       
-       false positive percentages
-           0.000  0.000  tied
-           0.000  0.000  tied
-           0.050  0.050  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.025  lost
-           0.075  0.075  tied
-           0.050  0.025  won
-           0.025  0.025  tied
-           0.000  0.025  lost
-           0.050  0.075  lost
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-       
-       won   1 times
-       tied 16 times
-       lost  3 times
-       
-       total unique fp went from 8 to 9
-       
-       false negative percentages
-           0.945  0.909  won
-           0.836  0.800  won
-           1.200  1.091  won
-           1.418  1.381  won
-           1.455  1.491  lost
-           1.091  1.055  won
-           1.091  0.945  won
-           1.236  1.236  tied
-           1.564  1.564  tied
-           1.236  1.200  won
-           1.563  1.454  won
-           1.563  1.599  lost
-           1.236  1.236  tied
-           0.836  0.800  won
-           0.873  0.836  won
-           1.236  1.236  tied
-           1.273  1.236  won
-           1.018  1.055  lost
-           1.091  1.127  lost
-           1.490  1.381  won
-       
-       won  12 times
-       tied  4 times
-       lost  4 times
-       
-       total unique fn went from 292 to 284
-       
-2002-09-03 06:57  tim_one
-
-       * classifier.py (1.8):
-
-       Added a new xspamprob() method, which computes the combined probability
-       "correctly", and a long comment block explaining what happened when I
-       tried it.  There's something worth pursuing here (it greatly improves
-       the false negative rate), but this change alone pushes too many marginal
-       hams into the spam camp
-       
-2002-09-03 05:23  tim_one
-
-       * timtest.py (1.12):
-
-       Made "skip:" tokens shorter.
-       
-       Added a surprising treatment of Organization headers, with a tiny f-n
-       benefit for a tiny cost.  No change in f-p stats.
-       
-       false negative percentages
-           1.091  0.945  won
-           0.945  0.836  won
-           1.236  1.200  won
-           1.454  1.418  won
-           1.491  1.455  won
-           1.091  1.091  tied
-           1.127  1.091  won
-           1.236  1.236  tied
-           1.636  1.564  won
-           1.345  1.236  won
-           1.672  1.563  won
-           1.599  1.563  won
-           1.236  1.236  tied
-           0.836  0.836  tied
-           1.018  0.873  won
-           1.236  1.236  tied
-           1.273  1.273  tied
-           1.055  1.018  won
-           1.091  1.091  tied
-           1.527  1.490  won
-       
-       won  13 times
-       tied  7 times
-       lost  0 times
-       
-       total unique fn went from 302 to 292
-       
-2002-09-03 02:18  tim_one
-
-       * timtest.py (1.11):
-
-       tokenize_word():  dropped the prefix from the signature; it's faster
-       to let the caller do it, and this also repaired a bug in one place it
-       was being used (well, a *conceptual* bug anyway, in that the code didn't
-       do what I intended there).  This changes the stats in an insignificant
-       way.  The f-p stats didn't change.  The f-n stats shifted by one message
-       in a few cases:
-       
-       false negative percentages
-           1.091  1.091  tied
-           0.945  0.945  tied
-           1.200  1.236  lost
-           1.454  1.454  tied
-           1.491  1.491  tied
-           1.091  1.091  tied
-           1.091  1.127  lost
-           1.236  1.236  tied
-           1.636  1.636  tied
-           1.382  1.345  won
-           1.636  1.672  lost
-           1.599  1.599  tied
-           1.236  1.236  tied
-           0.836  0.836  tied
-           1.018  1.018  tied
-           1.236  1.236  tied
-           1.273  1.273  tied
-           1.055  1.055  tied
-           1.091  1.091  tied
-           1.527  1.527  tied
-       
-       won   1 times
-       tied 16 times
-       lost  3 times
-       
-       total unique unchanged
-       
-2002-09-02 19:30  tim_one
-
-       * timtest.py (1.10):
-
-       Don't ask me why this helps -- I don't really know!  When skipping "long
-       words", generating a token with a brief hint about what and how much got
-       skipped makes a definite improvement in the f-n rate, and doesn't affect
-       the f-p rate at all.  Since experiment said it's a winner, I'm checking
-       it in.  Before (left columan) and after (right column):
-       
-       false positive percentages
-           0.000  0.000  tied
-           0.000  0.000  tied
-           0.050  0.050  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.075  0.075  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.050  0.050  tied
-           0.025  0.025  tied
-           0.025  0.025  tied
-           0.000  0.000  tied
-           0.025  0.025  tied
-           0.050  0.050  tied
-       
-       won   0 times
-       tied 20 times
-       lost  0 times
-       
-       total unique fp went from 8 to 8
-       
-       false negative percentages
-           1.236  1.091  won
-           1.164  0.945  won
-           1.454  1.200  won
-           1.599  1.454  won
-           1.527  1.491  won
-           1.236  1.091  won
-           1.163  1.091  won
-           1.309  1.236  won
-           1.891  1.636  won
-           1.418  1.382  won
-           1.745  1.636  won
-           1.708  1.599  won
-           1.491  1.236  won
-           0.836  0.836  tied
-           1.091  1.018  won
-           1.309  1.236  won
-           1.491  1.273  won
-           1.127  1.055  won
-           1.309  1.091  won
-           1.636  1.527  won
-       
-       won  19 times
-       tied  1 times
-       lost  0 times
-       
-       total unique fn went from 336 to 302
-       
-2002-09-02 17:55  tim_one
-
-       * timtest.py (1.9):
-
-       Some comment changes and nesting reduction.
-       
-2002-09-02 11:18  tim_one
-
-       * timtest.py (1.8):
-
-       Fixed some out-of-date comments.
-       
-       Made URL clumping lumpier:  now distinguishes among just "first field",
-       "second field", and "everything else".
-       
-       Changed tag names for email address fields (semantically neutral).
-       
-       Added "From:" line tagging.
-       
-       These add up to an almost pure win.  Before-and-after f-n rates across 
20
-       runs:
-       
-       1.418   1.236
-       1.309   1.164
-       1.636   1.454
-       1.854   1.599
-       1.745   1.527
-       1.418   1.236
-       1.381   1.163
-       1.418   1.309
-       2.109   1.891
-       1.491   1.418
-       1.854   1.745
-       1.890   1.708
-       1.818   1.491
-       1.055   0.836
-       1.164   1.091
-       1.599   1.309
-       1.600   1.491
-       1.127   1.127
-       1.164   1.309
-       1.781   1.636
-       
-       It only increased in one run.  The variance appears to have been reduced
-       too (I didn't bother to compute that, though).
-       
-       Before-and-after f-p rates across 20 runs:
-       
-       0.000   0.000
-       0.000   0.000
-       0.075   0.050
-       0.000   0.000
-       0.025   0.025
-       0.050   0.025
-       0.075   0.050
-       0.025   0.025
-       0.025   0.025
-       0.025   0.000
-       0.100   0.075
-       0.050   0.050
-       0.025   0.025
-       0.000   0.000
-       0.075   0.050
-       0.025   0.025
-       0.025   0.025
-       0.000   0.000
-       0.075   0.025
-       0.100   0.050
-       
-       Note that 0.025% is a single message; it's really impossible to 
*measure*
-       an improvement in the f-p rate anymore with 4000-msg ham sets.
-       
-       Across all 20 runs,
-       
-       the total # of unique f-n fell from 353 to 336
-       the total # of unique f-p fell from 13 to 8
-       
-2002-09-02 10:06  tim_one
-
-       * timtest.py (1.7):
-
-       A number of changes.  The most significant is paying attention to the
-       Subject line (I was wrong before when I said my c.l.py ham corpus was
-       unusable for this due to Mailman-injected decorations).  In all, across
-       my 20 test runs,
-       
-       the total # of unique false positives fell from 23 to 13
-       the total # of unique false negatives rose from 337 to 353
-       
-       Neither result is statistically significant, although I bet the first
-       one would be if I pissed away a few days trying to come up with a more
-       realistic model for what "stat. sig." means here <wink>.
-       
-2002-09-01 17:22  tim_one
-
-       * classifier.py (1.7):
-
-       Added a comment block about HAMBIAS experiments.  There's no clearer
-       example of trading off precision against recall, and you can favor 
either
-       at the expense of the other to any degree you like by fiddling this 
knob.
-       
-2002-09-01 14:42  tim_one
-
-       * timtest.py (1.6):
-
-       Long new comment block summarizing all my experiments with character
-       n-grams.  Bottom line is that they have nothing going for them, and a
-       lot going against them, under Graham's scheme.  I believe there may
-       still be a place for them in *part* of a word-based tokenizer, though.
-       
-2002-09-01 10:05  tim_one
-
-       * classifier.py (1.6):
-
-       spamprob():  Never count unique words more than once anymore.  Counting
-       up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
-       that's now a small drag instead.
-       
-2002-09-01 07:33  tim_one
-
-       * rebal.py (1.3), timtest.py (1.5):
-
-       Folding case is here to stay.  Read the new comments for why.  This may
-       be a bad idea for other languages, though.
-       
-       Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
-       http is spam-neutral, but https is a strong spam indicator.  That
-       surprised me.
-       
-2002-09-01 06:47  tim_one
-
-       * classifier.py (1.5):
-
-       spamprob():  Removed useless check that wordstream isn't empty.  For one
-       thing, it didn't work, since wordstream is often an iterator.  Even if
-       it did work, it isn't needed -- the probability of an empty wordstream
-       gets computed as 0.5 based on the total absence of evidence.
-       
-2002-09-01 05:37  tim_one
-
-       * timtest.py (1.4):
-
-       textparts():  Worm around what feels like a bug in msg.walk() (Barry has
-       details).
-       
-2002-09-01 05:09  tim_one
-
-       * rebal.py (1.2):
-
-       Aha!  Staring at the checkin msg revealed a logic bug that explains why
-       my ham directories sometimes remained unbalanced after running this --
-       if the randomly selected reservoir msg turned out to be spam, it wasn't
-       pushing the too-small directory on the stack again.
-       
-2002-09-01 04:56  tim_one
-
-       * timtest.py (1.3):
-
-       textparts():  This was failing to weed out redundant HTML in cases like
-       this:
-       
-           multipart/alternative
-               text/plain
-               multipart/related
-                   text/html
-       
-       The tokenizer here also transforms everything to lowercase, but that's
-       an accident due simply to that I'm testing that now.  Can't say for
-       sure until the test runs end, but so far it looks like a bad idea for
-       the false positive rate.
-       
-2002-09-01 04:52  tim_one
-
-       * rebal.py (1.1):
-
-       A little script I use to rebalance the ham corpora after deleting what
-       turns out to be spam.  I have another Ham/reservoir directory with a
-       few thousand randomly selected msgs from the presumably-good archive.
-       These aren't used in scoring or training.  This script marches over all
-       the ham corpora directories that are used, and if any have gotten too
-       big (this never happens anymore) deletes msgs at random from them, and
-       if any have gotten too small plugs the holes by moving in random
-       msgs from the reservoir.
-       
-2002-09-01 03:25  tim_one
-
-       * classifier.py (1.4), timtest.py (1.2):
-
-       Boost UNKNOWN_SPAMPROB.
-       # The spam probability assigned to words never seen before.  Graham used
-       # 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  
In
-       # Tim's content-only tests (no headers), boosting to 0.5 cut the false
-       # negative rate by over 1/3.  The f-p rate increased, but there were so 
few
-       # f-ps that the increase wasn't statistically significant.  It also 
caught
-       # 13 more spams erroneously classified as ham.  By eyeball (and common
-       # sense <wink>), this has most effect on very short messages, where 
there
-       # simply aren't many high-value words.  A word with prob 0.5 is (in 
effect)
-       # completely ignored by spamprob(), in favor of *any* word with *any* 
prob
-       # differing from 0.5.  At 0.2, an unknown word favors ham at the expense
-       # of kicking out a word with a prob in (0.2, 0.8), and that seems 
dubious
-       # on the face of it.
-       
-2002-08-31 16:50  tim_one
-
-       * timtest.py (1.1):
-
-       This is a driver I've been using for test runs.  It's specific to my
-       corpus directories, but has useful stuff in it all the same.
-       
-2002-08-31 16:49  tim_one
-
-       * classifier.py (1.3):
-
-       The explanation for these changes was on Python-Dev.  You'll find out
-       why if the moderator approves the msg <wink>.
-       
-2002-08-29 07:04  tim_one
-
-       * Tester.py (1.2), classifier.py (1.2):
-
-       Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
-       functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
-       too hard to get motivated to reduce 0.01 <0.1 wink>).
-       
-       GrahamBayes.spamprob:  New optional bool argument; when true, a list of
-       the 15 strongest (word, probability) pairs is returned as well as the
-       overall probability (this is how to find out why a message scored as it
-       did).
-       
-2002-08-28 13:45  montanaro
-
-       * GBayes.py (1.15):
-
-       ehh - it actually didn't work all that well.  the spurious report that 
it
-       did well was pilot error.  besides, tim's report suggests that a simple
-       str.split() may be the best tokenizer anyway.
-       
-2002-08-28 10:45  montanaro
-
-       * setup.py (1.1):
-
-       trivial little setup.py file - i don't expect most people will be 
interested
-       in this, but it makes it a tad simpler to work with now that there are 
two
-       files
-       
-2002-08-28 10:43  montanaro
-
-       * GBayes.py (1.14):
-
-       add simple trigram tokenizer - this seems to yield the best results I've
-       seen so far (but has not been extensively tested)
-       
-2002-08-28 08:10  tim_one
-
-       * Tester.py (1.1):
-
-       A start at a testing class.  There isn't a lot here, but it automates
-       much of the tedium, and as the doctest shows it can already do
-       useful things, like remembering which inputs were misclassified.
-       
-2002-08-27 06:45  tim_one
-
-       * mboxcount.py (1.5):
-
-       Updated stats to what Barry and I both get now.  Fiddled output.
-       
-2002-08-27 05:09  bwarsaw
-
-       * split.py (1.5), splitn.py (1.2):
-
-       _factory(): Return the empty string instead of None in the except
-       clauses, so that for-loops won't break prematurely.  mailbox.py's base
-       class defines an __iter__() that raises a StopIteration on None
-       return.
-       
-2002-08-27 04:55  tim_one
-
-       * GBayes.py (1.13), mboxcount.py (1.4):
-
-       Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
-       
-2002-08-27 04:40  bwarsaw
-
-       * mboxcount.py (1.3):
-
-       Some stats after splitting b/w good messages and unparseable messages
-       
-2002-08-27 04:23  bwarsaw
-
-       * mboxcount.py (1.2):
-
-       _factory(): Use a marker object to designate between good messages and
-       unparseable messages.  For some reason, returning None from the except
-       clause in _factory() caused Python 2.2.1 to exit early out of the for
-       loop.
-       
-       main(): Print statistics about both the number of good messages and
-       the number of unparseable messages.
-       
-2002-08-27 03:06  tim_one
-
-       * cleanarch (1.2):
-
-       "From " is a header more than a separator, so don't bump the msg count
-       at the end.
-       
-2002-08-24 01:42  tim_one
-
-       * GBayes.py (1.12), classifier.py (1.1):
-
-       Moved all the interesting code that was in the *original* GBayes.py into
-       a new classifier.py.  It was designed to have a very clean interface,
-       and there's no reason to keep slamming everything into one file.  The
-       ever-growing tokenizer stuff should probably also be split out, leaving
-       GBayes.py a pure driver.
-       
-       Also repaired _test() (Skip's checkin left it without a binding for
-       the tokenize function).
-       
-2002-08-24 01:17  tim_one
-
-       * splitn.py (1.1):
-
-       Utility to split an mbox into N random pieces in one gulp.  This gives
-       a convenient way to break a giant corpus into multiple files that can
-       then be used independently across multiple training and testing runs.
-       It's important to do multiple runs on different random samples to avoid
-       drawing conclusions based on accidents in a single random training 
corpus;
-       if the algorithm is robust, it should have similar performance across
-       all runs.
-       
-2002-08-24 00:25  montanaro
-
-       * GBayes.py (1.11):
-
-       Allow command line specification of tokenize functions
-           run w/ -t flag to override default tokenize function
-           run w/ -H flag to see list of tokenize functions
-       
-       When adding a new tokenizer, make docstring a short description and add 
a
-       key/value pair to the tokenizers dict.  The key is what the user 
specifies.
-       The value is a tokenize function.
-       
-       Added two new tokenizers - tokenize_wordpairs_foldcase and
-       tokenize_words_and_pairs.  It's not obvious that either is better than 
any
-       of the preexisting functions.
-       
-       Should probably add info to the pickle which indicates the tokenizing
-       function used to build it.  This could then be the default for spam
-       detection runs.
-       
-       Next step is to drive this with spam/non-spam corpora, selecting each 
of the
-       various tokenizer functions, and presenting the results in tabular form.
-       
-2002-08-23 13:10  tim_one
-
-       * GBayes.py (1.10):
-
-       spamprob():  Commented some subtleties.
-       
-       clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
-       is that you can't delete entries from a dict that's being crawled over
-       by .iteritems(), which is why I (I suddenly recall) materialized a
-       list of words to be deleted the first time I wrote this.  It's a lot
-       better to materialize a list of to-be-deleted words than to materialize
-       the entire database in a dict.items() list.
-       
-2002-08-23 12:36  tim_one
-
-       * mboxcount.py (1.1):
-
-       Utility to count and display the # of msgs in (one or more) Unix mboxes.
-       
-2002-08-23 12:11  tim_one
-
-       * split.py (1.4):
-
-       Open files in binary mode.  Else, e.g., about 400MB of Barry's 
python-list
-       corpus vanishes on Windows.  Also use file.write() instead of print>>, 
as
-       the latter invents an extra newline.
-       
-2002-08-22 07:01  tim_one
-
-       * GBayes.py (1.9):
-
-       Renamed "modtime" to "atime", to better reflect its meaning, and added a
-       comment block to explain that better.
-       
-2002-08-21 08:07  bwarsaw
-
-       * split.py (1.3):
-
-       Guido suggests a different order for the positional args.
-       
-2002-08-21 07:37  bwarsaw
-
-       * split.py (1.2):
-
-       Get rid of the -1 and -2 arguments and make them positional.
-       
-2002-08-21 07:18  bwarsaw
-
-       * split.py (1.1):
-
-       A simple mailbox splitter
-       
-2002-08-21 06:42  tim_one
-
-       * GBayes.py (1.8):
-
-       Added a bunch of simple tokenizers.  The originals are renamed to
-       tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
-       New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
-       tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
-       any of these to be the last word.  When Barry has the test corpus
-       set up it should be easy to let the data tell us which "pure" strategy
-       works best.  Straight character n-grams are very appealing because
-       they're the simplest and most language-neutral; I didn't have any luck
-       with them over the weekend, but the size of my training data was
-       trivial.
-       
-2002-08-21 05:08  bwarsaw
-
-       * cleanarch (1.1):
-
-       An archive cleaner, adapted from the Mailman 2.1b3 version, but
-       de-Mailman-ified.
-       
-2002-08-21 04:44  gvanrossum
-
-       * GBayes.py (1.7):
-
-       Indent repair in clearjunk().
-       
-2002-08-21 04:22  gvanrossum
-
-       * GBayes.py (1.6):
-
-       Some minor cleanup:
-       
-       - Move the identifying comment to the top, clarify it a bit, and add
-         author info.
-       
-       - There's no reason for _time and _heapreplace to be hidden names;
-         change these back to time and heapreplace.
-       
-       - Rename main1() to _test() and main2() to main(); when main() sees
-         there are no options or arguments, it runs _test().
-       
-       - Get rid of a list comprehension from clearjunk().
-       
-       - Put wordinfo.get as a local variable in _add_msg().
-       
-2002-08-20 15:16  tim_one
-
-       * GBayes.py (1.5):
-
-       Neutral typo repairs, except that clearjunk() has a better chance of
-       not blowing up immediately now <wink -- I have yet to try it!>.
-       
-2002-08-20 13:49  montanaro
-
-       * GBayes.py (1.4):
-
-       help make it more easily executable... ;-)
-       
-2002-08-20 09:32  bwarsaw
-
-       * GBayes.py (1.3):
-
-       Lots of hacks great and small to the main() program, but I didn't
-       touch the guts of the algorithm.
-       
-       Added a module docstring/usage message.
-       
-       Added a bunch of switches to train the system on an mbox of known good
-       and known spam messages (using PortableUnixMailbox only for now).
-       Uses the email package but does not decoding of message bodies.  Also,
-       allows you to specify a file for pickling the training data, and for
-       setting a threshold, above which messages get an X-Bayes-Score
-       header.  Also output messages (marked and unmarked) to an output file
-       for retraining.
-       
-       Print some statistics at the end.
-       
-2002-08-20 05:43  tim_one
-
-       * GBayes.py (1.2):
-
-       Turned off debugging vrbl mistakenly checked in at True.
-       
-       unlearn():  Gave this an update_probabilities=True default arg, for
-       symmetry with learn().
-       
-2002-08-20 03:33  tim_one
-
-       * GBayes.py (1.1):
-
-       An implementation of Paul Graham's Bayes-like spam classifier.
-
-</pre>

Copied: trunk/website/presfchangelog.ht (from rev 3155, 
trunk/website/prefschangelog.ht)
===================================================================
--- trunk/website/presfchangelog.ht                             (rev 0)
+++ trunk/website/presfchangelog.ht     2007-07-25 13:51:11 UTC (rev 3156)
@@ -0,0 +1,905 @@
+<h2>Pre-Sourceforge ChangeLog</h2>
+<p>This changelog lists the commits on the spambayes projects before the
+   separate project was set up. See also the 
+<a 
href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0";>old
 CVS repository</a>, but don't forget that it's now out of date, and you 
probably want to be looking at <a 
href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/";>the current 
CVS</a>.
+</p>
+<pre>
+2002-09-06 02:27  tim_one
+
+       * GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
+       cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
+       (1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
+
+       This code has been moved to a new SourceForge project (spambayes).
+       
+2002-09-05 15:37  tim_one
+
+       * classifier.py (1.11):
+
+       Added note about MINCOUNT oddities.
+       
+2002-09-05 14:32  tim_one
+
+       * timtest.py (1.17):
+
+       Added note about word length.
+       
+2002-09-05 13:48  tim_one
+
+       * timtest.py (1.16):
+
+       tokenize_word():  Oops!  This was awfully permissive in what it
+       took as being "an email address".  Tightened that, and also
+       avoided 5-gram'ing of email addresses w/ high-bit characters.
+       
+       false positive percentages
+           0.000  0.000  tied
+           0.000  0.000  tied
+           0.050  0.050  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.050  lost
+           0.075  0.075  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+       
+       won   0 times
+       tied 19 times
+       lost  1 times
+       
+       total unique fp went from 7 to 8
+       
+       false negative percentages
+           0.764  0.691  won
+           0.691  0.655  won
+           0.981  0.945  won
+           1.309  1.309  tied
+           1.418  1.164  won
+           0.873  0.800  won
+           0.800  0.763  won
+           1.163  1.163  tied
+           1.491  1.345  won
+           1.200  1.127  won
+           1.381  1.345  won
+           1.454  1.490  lost
+           1.164  0.909  won
+           0.655  0.582  won
+           0.655  0.691  lost
+           1.163  1.163  tied
+           1.200  1.018  won
+           0.982  0.873  won
+           0.982  0.909  won
+           1.236  1.127  won
+       
+       won  15 times
+       tied  3 times
+       lost  2 times
+       
+       total unique fn went from 260 to 249
+       
+       Note:  Each of the two losses there consist of just 1 msg difference.
+       The wins are bigger as well as being more common, and 260-249 = 11
+       spams no longer sneak by any run (which is more than 4% of the 260
+       spams that used to sneak thru!).
+       
+2002-09-05 11:51  tim_one
+
+       * classifier.py (1.10):
+
+       Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
+       really matter; leaving it alone.
+       
+2002-09-05 10:02  tim_one
+
+       * classifier.py (1.9):
+
+       A now-rare pure win, changing spamprob() to work harder to find more
+       evidence when competing 0.01 and 0.99 clues appear.  Before in the left
+       column, after in the right:
+       
+       false positive percentages
+           0.000  0.000  tied
+           0.000  0.000  tied
+           0.050  0.050  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.075  0.075  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.075  0.025  won
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+       
+       won   1 times
+       tied 19 times
+       lost  0 times
+       
+       total unique fp went from 9 to 7
+       
+       false negative percentages
+           0.909  0.764  won
+           0.800  0.691  won
+           1.091  0.981  won
+           1.381  1.309  won
+           1.491  1.418  won
+           1.055  0.873  won
+           0.945  0.800  won
+           1.236  1.163  won
+           1.564  1.491  won
+           1.200  1.200  tied
+           1.454  1.381  won
+           1.599  1.454  won
+           1.236  1.164  won
+           0.800  0.655  won
+           0.836  0.655  won
+           1.236  1.163  won
+           1.236  1.200  won
+           1.055  0.982  won
+           1.127  0.982  won
+           1.381  1.236  won
+       
+       won  19 times
+       tied  1 times
+       lost  0 times
+       
+       total unique fn went from 284 to 260
+       
+2002-09-04 11:21  tim_one
+
+       * timtest.py (1.15):
+
+       Augmented the spam callback to display spams with low probability.
+       
+2002-09-04 09:53  tim_one
+
+       * Tester.py (1.3), timtest.py (1.14):
+
+       Added support for simple histograms of the probability distributions for
+       ham and spam.
+       
+2002-09-03 12:13  tim_one
+
+       * timtest.py (1.13):
+
+       A reluctant "on principle" change no matter what it does to the stats:
+       take a stab at removing HTML decorations from plain text msgs.  See
+       comments for why it's *only* in plain text msgs.  This puts an end to
+       false positives due to text msgs talking *about* HTML.  Surprisingly, it
+       also gets rid of some false negatives.  Not surprisingly, it introduced
+       another small class of false positives due to the dumbass regexp trick
+       used to approximate HTML tag removal removing pieces of text that had
+       nothing to do with HTML tags (e.g., this happened in the middle of a
+       uuencoded .py file in such a why that it just happened to leave behind
+       a string that "looked like" a spam phrase; but before this it looked
+       like a pile of "too long" lines that didn't generate any tokens --
+       it's a nonsense outcome either way).
+       
+       false positive percentages
+           0.000  0.000  tied
+           0.000  0.000  tied
+           0.050  0.050  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.025  lost
+           0.075  0.075  tied
+           0.050  0.025  won
+           0.025  0.025  tied
+           0.000  0.025  lost
+           0.050  0.075  lost
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+       
+       won   1 times
+       tied 16 times
+       lost  3 times
+       
+       total unique fp went from 8 to 9
+       
+       false negative percentages
+           0.945  0.909  won
+           0.836  0.800  won
+           1.200  1.091  won
+           1.418  1.381  won
+           1.455  1.491  lost
+           1.091  1.055  won
+           1.091  0.945  won
+           1.236  1.236  tied
+           1.564  1.564  tied
+           1.236  1.200  won
+           1.563  1.454  won
+           1.563  1.599  lost
+           1.236  1.236  tied
+           0.836  0.800  won
+           0.873  0.836  won
+           1.236  1.236  tied
+           1.273  1.236  won
+           1.018  1.055  lost
+           1.091  1.127  lost
+           1.490  1.381  won
+       
+       won  12 times
+       tied  4 times
+       lost  4 times
+       
+       total unique fn went from 292 to 284
+       
+2002-09-03 06:57  tim_one
+
+       * classifier.py (1.8):
+
+       Added a new xspamprob() method, which computes the combined probability
+       "correctly", and a long comment block explaining what happened when I
+       tried it.  There's something worth pursuing here (it greatly improves
+       the false negative rate), but this change alone pushes too many marginal
+       hams into the spam camp
+       
+2002-09-03 05:23  tim_one
+
+       * timtest.py (1.12):
+
+       Made "skip:" tokens shorter.
+       
+       Added a surprising treatment of Organization headers, with a tiny f-n
+       benefit for a tiny cost.  No change in f-p stats.
+       
+       false negative percentages
+           1.091  0.945  won
+           0.945  0.836  won
+           1.236  1.200  won
+           1.454  1.418  won
+           1.491  1.455  won
+           1.091  1.091  tied
+           1.127  1.091  won
+           1.236  1.236  tied
+           1.636  1.564  won
+           1.345  1.236  won
+           1.672  1.563  won
+           1.599  1.563  won
+           1.236  1.236  tied
+           0.836  0.836  tied
+           1.018  0.873  won
+           1.236  1.236  tied
+           1.273  1.273  tied
+           1.055  1.018  won
+           1.091  1.091  tied
+           1.527  1.490  won
+       
+       won  13 times
+       tied  7 times
+       lost  0 times
+       
+       total unique fn went from 302 to 292
+       
+2002-09-03 02:18  tim_one
+
+       * timtest.py (1.11):
+
+       tokenize_word():  dropped the prefix from the signature; it's faster
+       to let the caller do it, and this also repaired a bug in one place it
+       was being used (well, a *conceptual* bug anyway, in that the code didn't
+       do what I intended there).  This changes the stats in an insignificant
+       way.  The f-p stats didn't change.  The f-n stats shifted by one message
+       in a few cases:
+       
+       false negative percentages
+           1.091  1.091  tied
+           0.945  0.945  tied
+           1.200  1.236  lost
+           1.454  1.454  tied
+           1.491  1.491  tied
+           1.091  1.091  tied
+           1.091  1.127  lost
+           1.236  1.236  tied
+           1.636  1.636  tied
+           1.382  1.345  won
+           1.636  1.672  lost
+           1.599  1.599  tied
+           1.236  1.236  tied
+           0.836  0.836  tied
+           1.018  1.018  tied
+           1.236  1.236  tied
+           1.273  1.273  tied
+           1.055  1.055  tied
+           1.091  1.091  tied
+           1.527  1.527  tied
+       
+       won   1 times
+       tied 16 times
+       lost  3 times
+       
+       total unique unchanged
+       
+2002-09-02 19:30  tim_one
+
+       * timtest.py (1.10):
+
+       Don't ask me why this helps -- I don't really know!  When skipping "long
+       words", generating a token with a brief hint about what and how much got
+       skipped makes a definite improvement in the f-n rate, and doesn't affect
+       the f-p rate at all.  Since experiment said it's a winner, I'm checking
+       it in.  Before (left columan) and after (right column):
+       
+       false positive percentages
+           0.000  0.000  tied
+           0.000  0.000  tied
+           0.050  0.050  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.075  0.075  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.050  0.050  tied
+           0.025  0.025  tied
+           0.025  0.025  tied
+           0.000  0.000  tied
+           0.025  0.025  tied
+           0.050  0.050  tied
+       
+       won   0 times
+       tied 20 times
+       lost  0 times
+       
+       total unique fp went from 8 to 8
+       
+       false negative percentages
+           1.236  1.091  won
+           1.164  0.945  won
+           1.454  1.200  won
+           1.599  1.454  won
+           1.527  1.491  won
+           1.236  1.091  won
+           1.163  1.091  won
+           1.309  1.236  won
+           1.891  1.636  won
+           1.418  1.382  won
+           1.745  1.636  won
+           1.708  1.599  won
+           1.491  1.236  won
+           0.836  0.836  tied
+           1.091  1.018  won
+           1.309  1.236  won
+           1.491  1.273  won
+           1.127  1.055  won
+           1.309  1.091  won
+           1.636  1.527  won
+       
+       won  19 times
+       tied  1 times
+       lost  0 times
+       
+       total unique fn went from 336 to 302
+       
+2002-09-02 17:55  tim_one
+
+       * timtest.py (1.9):
+
+       Some comment changes and nesting reduction.
+       
+2002-09-02 11:18  tim_one
+
+       * timtest.py (1.8):
+
+       Fixed some out-of-date comments.
+       
+       Made URL clumping lumpier:  now distinguishes among just "first field",
+       "second field", and "everything else".
+       
+       Changed tag names for email address fields (semantically neutral).
+       
+       Added "From:" line tagging.
+       
+       These add up to an almost pure win.  Before-and-after f-n rates across 
20
+       runs:
+       
+       1.418   1.236
+       1.309   1.164
+       1.636   1.454
+       1.854   1.599
+       1.745   1.527
+       1.418   1.236
+       1.381   1.163
+       1.418   1.309
+       2.109   1.891
+       1.491   1.418
+       1.854   1.745
+       1.890   1.708
+       1.818   1.491
+       1.055   0.836
+       1.164   1.091
+       1.599   1.309
+       1.600   1.491
+       1.127   1.127
+       1.164   1.309
+       1.781   1.636
+       
+       It only increased in one run.  The variance appears to have been reduced
+       too (I didn't bother to compute that, though).
+       
+       Before-and-after f-p rates across 20 runs:
+       
+       0.000   0.000
+       0.000   0.000
+       0.075   0.050
+       0.000   0.000
+       0.025   0.025
+       0.050   0.025
+       0.075   0.050
+       0.025   0.025
+       0.025   0.025
+       0.025   0.000
+       0.100   0.075
+       0.050   0.050
+       0.025   0.025
+       0.000   0.000
+       0.075   0.050
+       0.025   0.025
+       0.025   0.025
+       0.000   0.000
+       0.075   0.025
+       0.100   0.050
+       
+       Note that 0.025% is a single message; it's really impossible to 
*measure*
+       an improvement in the f-p rate anymore with 4000-msg ham sets.
+       
+       Across all 20 runs,
+       
+       the total # of unique f-n fell from 353 to 336
+       the total # of unique f-p fell from 13 to 8
+       
+2002-09-02 10:06  tim_one
+
+       * timtest.py (1.7):
+
+       A number of changes.  The most significant is paying attention to the
+       Subject line (I was wrong before when I said my c.l.py ham corpus was
+       unusable for this due to Mailman-injected decorations).  In all, across
+       my 20 test runs,
+       
+       the total # of unique false positives fell from 23 to 13
+       the total # of unique false negatives rose from 337 to 353
+       
+       Neither result is statistically significant, although I bet the first
+       one would be if I pissed away a few days trying to come up with a more
+       realistic model for what "stat. sig." means here <wink>.
+       
+2002-09-01 17:22  tim_one
+
+       * classifier.py (1.7):
+
+       Added a comment block about HAMBIAS experiments.  There's no clearer
+       example of trading off precision against recall, and you can favor 
either
+       at the expense of the other to any degree you like by fiddling this 
knob.
+       
+2002-09-01 14:42  tim_one
+
+       * timtest.py (1.6):
+
+       Long new comment block summarizing all my experiments with character
+       n-grams.  Bottom line is that they have nothing going for them, and a
+       lot going against them, under Graham's scheme.  I believe there may
+       still be a place for them in *part* of a word-based tokenizer, though.
+       
+2002-09-01 10:05  tim_one
+
+       * classifier.py (1.6):
+
+       spamprob():  Never count unique words more than once anymore.  Counting
+       up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
+       that's now a small drag instead.
+       
+2002-09-01 07:33  tim_one
+
+       * rebal.py (1.3), timtest.py (1.5):
+
+       Folding case is here to stay.  Read the new comments for why.  This may
+       be a bad idea for other languages, though.
+       
+       Refined the embedded-URL tagging scheme.  Curious:  as a protocol,
+       http is spam-neutral, but https is a strong spam indicator.  That
+       surprised me.
+       
+2002-09-01 06:47  tim_one
+
+       * classifier.py (1.5):
+
+       spamprob():  Removed useless check that wordstream isn't empty.  For one
+       thing, it didn't work, since wordstream is often an iterator.  Even if
+       it did work, it isn't needed -- the probability of an empty wordstream
+       gets computed as 0.5 based on the total absence of evidence.
+       
+2002-09-01 05:37  tim_one
+
+       * timtest.py (1.4):
+
+       textparts():  Worm around what feels like a bug in msg.walk() (Barry has
+       details).
+       
+2002-09-01 05:09  tim_one
+
+       * rebal.py (1.2):
+
+       Aha!  Staring at the checkin msg revealed a logic bug that explains why
+       my ham directories sometimes remained unbalanced after running this --
+       if the randomly selected reservoir msg turned out to be spam, it wasn't
+       pushing the too-small directory on the stack again.
+       
+2002-09-01 04:56  tim_one
+
+       * timtest.py (1.3):
+
+       textparts():  This was failing to weed out redundant HTML in cases like
+       this:
+       
+           multipart/alternative
+               text/plain
+               multipart/related
+                   text/html
+       
+       The tokenizer here also transforms everything to lowercase, but that's
+       an accident due simply to that I'm testing that now.  Can't say for
+       sure until the test runs end, but so far it looks like a bad idea for
+       the false positive rate.
+       
+2002-09-01 04:52  tim_one
+
+       * rebal.py (1.1):
+
+       A little script I use to rebalance the ham corpora after deleting what
+       turns out to be spam.  I have another Ham/reservoir directory with a
+       few thousand randomly selected msgs from the presumably-good archive.
+       These aren't used in scoring or training.  This script marches over all
+       the ham corpora directories that are used, and if any have gotten too
+       big (this never happens anymore) deletes msgs at random from them, and
+       if any have gotten too small plugs the holes by moving in random
+       msgs from the reservoir.
+       
+2002-09-01 03:25  tim_one
+
+       * classifier.py (1.4), timtest.py (1.2):
+
+       Boost UNKNOWN_SPAMPROB.
+       # The spam probability assigned to words never seen before.  Graham used
+       # 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  
In
+       # Tim's content-only tests (no headers), boosting to 0.5 cut the false
+       # negative rate by over 1/3.  The f-p rate increased, but there were so 
few
+       # f-ps that the increase wasn't statistically significant.  It also 
caught
+       # 13 more spams erroneously classified as ham.  By eyeball (and common
+       # sense <wink>), this has most effect on very short messages, where 
there
+       # simply aren't many high-value words.  A word with prob 0.5 is (in 
effect)
+       # completely ignored by spamprob(), in favor of *any* word with *any* 
prob
+       # differing from 0.5.  At 0.2, an unknown word favors ham at the expense
+       # of kicking out a word with a prob in (0.2, 0.8), and that seems 
dubious
+       # on the face of it.
+       
+2002-08-31 16:50  tim_one
+
+       * timtest.py (1.1):
+
+       This is a driver I've been using for test runs.  It's specific to my
+       corpus directories, but has useful stuff in it all the same.
+       
+2002-08-31 16:49  tim_one
+
+       * classifier.py (1.3):
+
+       The explanation for these changes was on Python-Dev.  You'll find out
+       why if the moderator approves the msg <wink>.
+       
+2002-08-29 07:04  tim_one
+
+       * Tester.py (1.2), classifier.py (1.2):
+
+       Tester.py:  Repaired a comment.  The false_{positive,negative})_rate()
+       functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
+       too hard to get motivated to reduce 0.01 <0.1 wink>).
+       
+       GrahamBayes.spamprob:  New optional bool argument; when true, a list of
+       the 15 strongest (word, probability) pairs is returned as well as the
+       overall probability (this is how to find out why a message scored as it
+       did).
+       
+2002-08-28 13:45  montanaro
+
+       * GBayes.py (1.15):
+
+       ehh - it actually didn't work all that well.  the spurious report that 
it
+       did well was pilot error.  besides, tim's report suggests that a simple
+       str.split() may be the best tokenizer anyway.
+       
+2002-08-28 10:45  montanaro
+
+       * setup.py (1.1):
+
+       trivial little setup.py file - i don't expect most people will be 
interested
+       in this, but it makes it a tad simpler to work with now that there are 
two
+       files
+       
+2002-08-28 10:43  montanaro
+
+       * GBayes.py (1.14):
+
+       add simple trigram tokenizer - this seems to yield the best results I've
+       seen so far (but has not been extensively tested)
+       
+2002-08-28 08:10  tim_one
+
+       * Tester.py (1.1):
+
+       A start at a testing class.  There isn't a lot here, but it automates
+       much of the tedium, and as the doctest shows it can already do
+       useful things, like remembering which inputs were misclassified.
+       
+2002-08-27 06:45  tim_one
+
+       * mboxcount.py (1.5):
+
+       Updated stats to what Barry and I both get now.  Fiddled output.
+       
+2002-08-27 05:09  bwarsaw
+
+       * split.py (1.5), splitn.py (1.2):
+
+       _factory(): Return the empty string instead of None in the except
+       clauses, so that for-loops won't break prematurely.  mailbox.py's base
+       class defines an __iter__() that raises a StopIteration on None
+       return.
+       
+2002-08-27 04:55  tim_one
+
+       * GBayes.py (1.13), mboxcount.py (1.4):
+
+       Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
+       
+2002-08-27 04:40  bwarsaw
+
+       * mboxcount.py (1.3):
+
+       Some stats after splitting b/w good messages and unparseable messages
+       
+2002-08-27 04:23  bwarsaw
+
+       * mboxcount.py (1.2):
+
+       _factory(): Use a marker object to designate between good messages and
+       unparseable messages.  For some reason, returning None from the except
+       clause in _factory() caused Python 2.2.1 to exit early out of the for
+       loop.
+       
+       main(): Print statistics about both the number of good messages and
+       the number of unparseable messages.
+       
+2002-08-27 03:06  tim_one
+
+       * cleanarch (1.2):
+
+       "From " is a header more than a separator, so don't bump the msg count
+       at the end.
+       
+2002-08-24 01:42  tim_one
+
+       * GBayes.py (1.12), classifier.py (1.1):
+
+       Moved all the interesting code that was in the *original* GBayes.py into
+       a new classifier.py.  It was designed to have a very clean interface,
+       and there's no reason to keep slamming everything into one file.  The
+       ever-growing tokenizer stuff should probably also be split out, leaving
+       GBayes.py a pure driver.
+       
+       Also repaired _test() (Skip's checkin left it without a binding for
+       the tokenize function).
+       
+2002-08-24 01:17  tim_one
+
+       * splitn.py (1.1):
+
+       Utility to split an mbox into N random pieces in one gulp.  This gives
+       a convenient way to break a giant corpus into multiple files that can
+       then be used independently across multiple training and testing runs.
+       It's important to do multiple runs on different random samples to avoid
+       drawing conclusions based on accidents in a single random training 
corpus;
+       if the algorithm is robust, it should have similar performance across
+       all runs.
+       
+2002-08-24 00:25  montanaro
+
+       * GBayes.py (1.11):
+
+       Allow command line specification of tokenize functions
+           run w/ -t flag to override default tokenize function
+           run w/ -H flag to see list of tokenize functions
+       
+       When adding a new tokenizer, make docstring a short description and add 
a
+       key/value pair to the tokenizers dict.  The key is what the user 
specifies.
+       The value is a tokenize function.
+       
+       Added two new tokenizers - tokenize_wordpairs_foldcase and
+       tokenize_words_and_pairs.  It's not obvious that either is better than 
any
+       of the preexisting functions.
+       
+       Should probably add info to the pickle which indicates the tokenizing
+       function used to build it.  This could then be the default for spam
+       detection runs.
+       
+       Next step is to drive this with spam/non-spam corpora, selecting each 
of the
+       various tokenizer functions, and presenting the results in tabular form.
+       
+2002-08-23 13:10  tim_one
+
+       * GBayes.py (1.10):
+
+       spamprob():  Commented some subtleties.
+       
+       clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
+       is that you can't delete entries from a dict that's being crawled over
+       by .iteritems(), which is why I (I suddenly recall) materialized a
+       list of words to be deleted the first time I wrote this.  It's a lot
+       better to materialize a list of to-be-deleted words than to materialize
+       the entire database in a dict.items() list.
+       
+2002-08-23 12:36  tim_one
+
+       * mboxcount.py (1.1):
+
+       Utility to count and display the # of msgs in (one or more) Unix mboxes.
+       
+2002-08-23 12:11  tim_one
+
+       * split.py (1.4):
+
+       Open files in binary mode.  Else, e.g., about 400MB of Barry's 
python-list
+       corpus vanishes on Windows.  Also use file.write() instead of print>>, 
as
+       the latter invents an extra newline.
+       
+2002-08-22 07:01  tim_one
+
+       * GBayes.py (1.9):
+
+       Renamed "modtime" to "atime", to better reflect its meaning, and added a
+       comment block to explain that better.
+       
+2002-08-21 08:07  bwarsaw
+
+       * split.py (1.3):
+
+       Guido suggests a different order for the positional args.
+       
+2002-08-21 07:37  bwarsaw
+
+       * split.py (1.2):
+
+       Get rid of the -1 and -2 arguments and make them positional.
+       
+2002-08-21 07:18  bwarsaw
+
+       * split.py (1.1):
+
+       A simple mailbox splitter
+       
+2002-08-21 06:42  tim_one
+
+       * GBayes.py (1.8):
+
+       Added a bunch of simple tokenizers.  The originals are renamed to
+       tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
+       New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
+       tokenize_5gram, tokenize_10gram, and tokenize_15gram.  I don't expect
+       any of these to be the last word.  When Barry has the test corpus
+       set up it should be easy to let the data tell us which "pure" strategy
+       works best.  Straight character n-grams are very appealing because
+       they're the simplest and most language-neutral; I didn't have any luck
+       with them over the weekend, but the size of my training data was
+       trivial.
+       
+2002-08-21 05:08  bwarsaw
+
+       * cleanarch (1.1):
+
+       An archive cleaner, adapted from the Mailman 2.1b3 version, but
+       de-Mailman-ified.
+       
+2002-08-21 04:44  gvanrossum
+
+       * GBayes.py (1.7):
+
+       Indent repair in clearjunk().
+       
+2002-08-21 04:22  gvanrossum
+
+       * GBayes.py (1.6):
+
+       Some minor cleanup:
+       
+       - Move the identifying comment to the top, clarify it a bit, and add
+         author info.
+       
+       - There's no reason for _time and _heapreplace to be hidden names;
+         change these back to time and heapreplace.
+       
+       - Rename main1() to _test() and main2() to main(); when main() sees
+         there are no options or arguments, it runs _test().
+       
+       - Get rid of a list comprehension from clearjunk().
+       
+       - Put wordinfo.get as a local variable in _add_msg().
+       
+2002-08-20 15:16  tim_one
+
+       * GBayes.py (1.5):
+
+       Neutral typo repairs, except that clearjunk() has a better chance of
+       not blowing up immediately now <wink -- I have yet to try it!>.
+       
+2002-08-20 13:49  montanaro
+
+       * GBayes.py (1.4):
+
+       help make it more easily executable... ;-)
+       
+2002-08-20 09:32  bwarsaw
+
+       * GBayes.py (1.3):
+
+       Lots of hacks great and small to the main() program, but I didn't
+       touch the guts of the algorithm.
+       
+       Added a module docstring/usage message.
+       
+       Added a bunch of switches to train the system on an mbox of known good
+       and known spam messages (using PortableUnixMailbox only for now).
+       Uses the email package but does not decoding of message bodies.  Also,
+       allows you to specify a file for pickling the training data, and for
+       setting a threshold, above which messages get an X-Bayes-Score
+       header.  Also output messages (marked and unmarked) to an output file
+       for retraining.
+       
+       Print some statistics at the end.
+       
+2002-08-20 05:43  tim_one
+
+       * GBayes.py (1.2):
+
+       Turned off debugging vrbl mistakenly checked in at True.
+       
+       unlearn():  Gave this an update_probabilities=True default arg, for
+       symmetry with learn().
+       
+2002-08-20 03:33  tim_one
+
+       * GBayes.py (1.1):
+
+       An implementation of Paul Graham's Bayes-like spam classifier.
+
+</pre>


This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.
_______________________________________________
Spambayes-checkins mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-checkins

[Spambayes-checkins] SF.net SVN: spambayes: [3156] trunk/website

Reply via email to