message metadata (for Bayes etc.)

Justin Mason 19 Feb 2004 18:48:12 -0000

... is now checked in.

There's a new class (ha!) -- Mail::SpamAssassin::MsgMetadata.
I fully expect the name to change. ;)


This class is tasked with extracting "metadata" from messages for use as
Bayes tokens, fodder for eval tests, or other rules.  Metadata is
supplemental data inferred from the message -- some other classifier
systems also call it "features".

The idea is that there was a lot of stuff we extracted from the message,
like originating untrusted relays, language, or country code of relay
address -- these would make very good Bayes tokens.

However, because we only extracted those in eval test code, there was no
way to do this without running eval test code in sa-learn.

So to solve this, the code to extract these features has been moved into
a new class, *out* of the eval-test/scanner objects, into the message
objects.   This way, both the scanner code and the learner code can
call them to extract metadata, the scanner can then use them in eval
rules as before, and the learner can add them to the bayes db.


HOW IT'S STORED

Metadata is held in two forms inside the message object:

1. as name-value pairs of strings, presented in mail header format.  For
example, "X-Language" => "en".  This is the general form for simple
metadata that's useful as Bayes tokens, can be added to marked-up messages
using "add_header", etc., such as the trusted-relay inference and language
detection.  The MsgContainer object has methods: get_metadata,
put_metadata, get_all_metadata to manipulate this simple type.

2. as more complex data structures on the $msg->{metadata} object.  This
is the form used for metadata like the HTML parse data, which is stored
there for access by eval rule code, and the rendered body text arrays.
Because it's not simple strings, it's not added as a Bayes token by
default (Bayes needs simple strings).

(Perhaps this class is not an appropriate place for #2 data... see
below.)


WHAT FEATURES WE EXTRACT

The current metadata extracted:

- the Received headers are parsed here and added as metadata strings
  called "X-Spam-Relays-Trusted" and "X-Spam-Relays-Untrusted".

- language detection is called here (if "ok_languages" != "all") and the
  language token is added as a metadatum called "X-Language".  (TODO:
  this should be conditional, because language rec is a slow process,
  but is ("ok_languages" != "all") the right way to enable it?)

- we call into plugins that have registered an "extract_metadata" method,
  and they can add whatever metadata they feel like adding.  For example,
  the new object Mail::SpamAssassin::Plugin::RelayCountry will use
  IP::Country::Fast to resolve IP addresses of untrusted relays to
  country-codes, and add that as metadata in the token "X-Relay-Country"
  (if I recall correctly).

  This is the preferred way to deal with expensive metadata extraction
  methods; if the admin wants to add that kind of metadata, they can load
  the plugin in question, and it will be added.

- In addition, the MsgMetadata class holds some parsing/rendering code; it
  calls the HTML renderer and holds the HTML features hash, and also now
  holds the functions that make the "decoded"/"rendered" text arrays.
  
  Note that there's an open question as to whether rendered data, and
  features discovered during that rendering, are really "metadata".  These
  may be more appropriate to put in another class, either in the root
  MsgContainer or another class off that.


Finally, another detail about how it's stored in our object model.  It's
hung off the root node of the message, in an obj ref called
$msg->{metadata}.

Only the message root node has a {metadata} member, because metadata is a
per-message thing, not a per-part thing. 

TODO: I still think a special subclass of MsgContainer for the message
root node is appropriate.  I added a set_is_root() method so that parse()
can inform the root node that it's the root in the meantime.  But I would
prefer to make a subclass of MsgContainer instead that makes that method
useless...


LIFECYCLE

The metadata is extracted relatively lazily.   When a message is first
parse()d, the metadata object is created, but unpopulated.  If the message
is then passed into PerMsgStatus::check(), Bayes::learn(), or
Bayes::scan(), the method $msg->extract_message_metadata() is called.
This method will then perform all metadata extraction.

The method will only perform this action once per message; once the
metadata is extracted, any further calls to that method are ignored.

Finally, once the metadata is no longer needed -- when a scanner object
that scanned that message is deleted, for example -- the finish_metadata()
method is called to remove any metadata on the message, and return it
to being a simple representation of the message itself.

If the message object itself is destroyed using finish(), the metadata
it holds is also destroyed.

--j.

message metadata (for Bayes etc.)

Reply via email to