On 2015-05-03 10:55, Reindl Harald wrote:
recently i observed by playing around with bayes-training that some junk
(maybe unintentional) is using the mimetype 'application/octet-stream'
instead 'text/html' containing the payload of a form with javascript
prevets the attachment from tokenizing
________________________________________

the new feature in 3.4.1 will take care of that while i am not sure how
much impact in classifying a trained attachment at the end has

SHA1 digests of all MIME parts (including non-textual) can now be
contributed to Bayes tokens, which allows the bayes classifier to assess
also the non-textual content. The set of sources of bayes tokens is
configurable with a new configuration option 'bayes_token_sources'
as documented in the Mail::SpamAssassin::Conf man page. (Bug 7115)
It is disabled by default for backward compatibility.
________________________________________

i am not sure here in context of "backward compatibility"

Just a cautionary speech. There were some concerns whether
it is beneficial or not to contribute digests of non-textual
parts or not, and not much experience has been gained yet,
so to avoid any potential surprise the default is the same
as with 3.4.0, i.e. digests are not included.

In my experience it can be valuable to include these, and
I haven't seen any ill effect while observing top-10
bayes tokens containing digests, as logged by a debug log,
for several weeks.

correct me but IMHO "bayes_token_sources all" should not have a side
effect when you train a bayes on SA 3.4.1 and share it with a setup
using 3.4.0 - the 3.4.0 setup just should not benefit from the new
mimeparts-tokens in the database but still from all others?

That is correct, learned digest tokens as inserted by 3.4.1 are
ignored by 3.4.0 code.

Btw, note that spamd does not process messages larger than some
pre-set size limit. Even if truncated messages are passed to
spamd, it would not see MIME parts beyond the truncation limit.
This is unlike what the current (to-be-released) version of
amavisd does: regardless of mail size amavisd would compute
digests of *all* pristine mail parts, and pass them to SpamAssassin
out-of-band, already ready-to-use, even if a message is truncated.
This also avoids some pre-processing 'corruption' of MIME digests
when computed by SpamAssassin, as a 'pristine' mail as understood
by SpamAssassin is sometimes a little less 'pristine' than ideal,
e.g. due to squashing long runs of empty lines in a message,
and splitting long paragraphs into chunks.

With MIME digests it's the same approach as with DKIM signatures,
which are also pre-computed by amavisd on the complete (non-truncated)
pristine message, and passed to SpamAssassin for use in the DKIM
plugin.

  Mark



Reply via email to