[Tony talking about the weighted cost calculation]
> That's another possibility, although it would
> probably be more difficult to compare against other
> spam filters (especially if anyone did adjust the
> weights).
Yes - if it was to be used to compare then the weights would have to be agreed
on in advance.
> John's main point in his "batting average" article was that a
> single accuracy score makes it difficult to see the difference
> between filters that reduce false positives by letting though
> a lot of spam vs. filters that kill almost all of the spam at
> the expense of increased false positives. By reporting the scores
> separately, the user can make the tradeoff based on what is more
> important to them.
The cost does this as long as the weights are correct for that user, though.
e.g. if I *hated* fp's, didn't care at all about unsures, and hardly cared
about fn's, I could have weights of (eg) 100.0, 0.0 and 0.1 (respectively) and
the score would reflect what was important to me. Kinda like John's method of
dividing the two numbers into each other, but better.
If the user (or reviewer, or whatever) is able to understand having two (or
four!) numbers, then that's better, though. Comparing filters is hard for many
other reasons, anyway (training regime, mail stream, etc)
[consolidating stats code]
> That would be good, but difficult currently because
> they take entirely different approaches. The Outlook
> addin totals up the stats as it goes,
> while sb_server recalculates them by iterating through
> the data in the messageinfo database.
I had forgotten about some of this, although I was thinking about a higher
level consolidation taking the raw counts, as you suggest.
> Maybe the changes you made to utilize the same
> messageinfo database for Outlook will allow us to
> calculate the Outlook stats the same way.
That's an interesting idea. It would save us having the separate database.
I've wondered (since I wrote the web interface method) whether it would get
really slow as the db increases in size, since it iterates through the whole
thing each time the stats are generated. I should have a play around and see
if that is going to be a problem or not (if so, maybe some sort of middle
ground between the methods can be found that both systems can use).
>> What do you think about the stats that are requested in the tracker?
> Are you refering to RFE #765924 regarding breaking down the stats by
> hour/day/week, etc? That seems like a lot of work for a questionable
> value, especially since we would probably have to store a bit more
> data in messageinfo to allow it.
Sorry, that was rather vague. Yes, I did mean that RFE. Those were my
thoughts too. Maybe a little script that just printed out the current stats
would be sufficient - if someone really wanted daily/whatever stats, they could
just set up some utility to call that script at the appropriate interval. The
number classified would say how much mail was received in that period, and you
could probably extract that rest from it. Without any more demand, though, I'm
inclined to leave it.
[Reset stats button]
> Should be easy enough, I'll take a look.
Thanks :)
> It would probably be nice to save the date when the
> statistics were last reset, as well.
Good idea.
> I haven't done much with pickles. Is that something
> that could be easily added to the stats file?
>From memory (I don't have access to the code from where I am at the moment),
>the pickle is just a dict that gets saved. So you could just add another
>value ('stats["RESET_DATE"] = date' or something) and it would get saved.
However, I had forgotten until reading your message about the differences
between how the web interface and Outlook go about it. If it is now possible
for the plugin to use the messageinfo db, then maybe we don't need the stats
pickle any more. We could store a classified_date (and trained_date?) in the
messageinfo db easily enough, and then only pull the data we want (adding a
'current stats starting point' value too, I guess). I'll think about this and
have a look at how quickly the db is going to increase in size (it's already
going to be larger than the old version).
=Tony.Meyer
_______________________________________________
spambayes-dev mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/spambayes-dev