Re: HTML Filter

Jonathan Angliss Thu, 12 Sep 2002 23:05:50 -0700

Hi Sudip,
On Fri, 13 Sep 2002 11:35:26 +0545, you wrote:

> > HTML is put into emails in two main ways (that I know of), inline,
> > and via attachment. You can filter for inline by searching the
> > kludges for "Content-Type: text/html"...you cannot search for it if
> > it is done via attachment as TB! doesn't support filtering on
> > attachment headers.
> 
> How do I know if the HTML portion is embedded inline or as attachment?


You have to look at the message source itself.  If the top headers only contain
Content-Type: plain/html then it is inline... if it contains
multipart/alternative or similar, then it is an attachment, although some mail
clients sometimes hide that fact.

> ,----- [ Begin Quote ]
> | Subject: Message Subject
> | Reply-To: [EMAIL PROTECTED]
> | Content-Type: multipart/alternative;

This line is the hint... it tells you there are multiple parts..


> | boundary="part1_153.13f2cdab.2ab29e4b_boundary"

This tells you where each part starts... [1]

> | --part1_153.13f2cdab.2ab29e4b_boundary

Here is the start of the first part of the message...

> | Content-Type: text/plain; charset="ISO-8859-1"

This line tells the email client that the following text should be read as plain
text with the character set of ISO-8859-1.

> | Content-Transfer-Encoding: quoted-printable

Just the encoding type... nothing important here.

> | [Message Text]

I think you know what this bit is ;)


> | --part1_153.13f2cdab.2ab29e4b_boundary

This is the start of the second part... the boundaries are the same so the mail
can work out where each attachment starts and stops.

> | Content-Type: text/html; charset=ISO-8859-1

This is the line that tells the email client the following text should be read
via an html engine of some type.

> | Content-Transfer-Encoding: quoted-printable

Encoding type again... not of any use in this example.

> | <HTML>

Your HTML version of the above message.


> If this isn't a part of kludges, what is it part of? Message body? But
> the filter setup to detect this in the text fails as well.

Notice the bit I put in [1] up the top.  That is the end of the kludges.  The
body starts right after a two CRLF from the end of the last header.  The
remainder is body.  Now what I have noticed (Marck, Allie, or somebody that
knows the full story may be able to tell you), is that filters will only search
text/plain if there are attachments, and I think only the first one it comes to.
This I guess is to stop it scanning through large attachements as they are also
part of the body, and can also reduce false positives.  Try putting (if you can)
a mail filter on the mail server (via procmail is easiest) that filters the body
for the word "sex".  You'll get a LOT of false positives caused by attachments
because it appears a few times in the apparent random base64 encoding.  I hope
that kind of makes sense... if not I can try explaining it a little clearer, or
maybe one of the others that may be able to put it in clearer terms may be able
to help ;)

-- 
Jonathan Angliss
([EMAIL PROTECTED])

________________________________________________
Current version is 1.61 | "Using TBUDL" information:
http://www.silverstones.com/thebat/TBUDLInfo.html

Re: HTML Filter

Reply via email to