Re: Escaping "From" separator line in an mbox

2001-12-27 Thread Philip Mak

On Thu, 27 Dec 2001, Matthew D. Fuller wrote:

> Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd
> pretty much have to do I guess.  To me, it just seems like putting too
> much trust in the LDA, whatever that may be, but...  Then again, why not
> trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
> ;)

Looking in my sent-mail folder from pine that had a message with unescaped
"From 66.28.28.22: Destination Host Unreachable", it did not have a
Content-Length header. Here is the headers for that message:

>From [EMAIL PROTECTED] Sat Nov 10 03:17:28 2001 -0500
Date: Sat, 10 Nov 2001 03:17:28 -0500 (EST)
From: Philip Mak <[EMAIL PROTECTED]>
X-Sender:  <[EMAIL PROTECTED]>
To:  <[EMAIL PROTECTED]>
cc: James Ventrillo <[EMAIL PROTECTED]>,
Mike Little <[EMAIL PROTECTED]>
Subject: IP address problems on buildreferrals.com
Message-ID: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Status:
X-Keywords:
X-UID: 47

I'm guessing that mutt/pine/etc. use some best effort heuristics to
determine when a "From " line is a message separator. For example:

- A message separator only occurs after a blank line.
- A message separator contains "From ", an envelope sender address (as
  defined in RFC822 appendix d "addrspec"), whitespace, and a timestamp
  (weekday month day time [timezone] year).

It seems that there is not a reliable mechanism for unescaping ">From"
lines; I've found out that if I send a message that says "From" to myself
(using pine with mbox), it will become ">From" in some cases. I'm guessing
this is one of those things that should have been standardized, but
everyone just did it ad hoc and now it's a mess.

"man mbox" on my system says:

   In order to avoid mis-interpretation of lines  in  message
   bodies  which  begin with the four characters "From", fol
   lowed by a space character, the character ">" is  commonly
   prepended in front of such lines.

It says "commonly prepended", which implies that it doesn't have to be. :(

So it would seem that for the mbox to Maildir conversion program that I'm
writing, the best thing that I can manage is to make it recognize a "From"
line as a message separator based on those two heuristics (preceding blank
line, and correct syntax)  above.




Re: Escaping "From" separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Thu, Dec 27, 2001 at 06:54:57AM -0500 I heard the voice of
David T-G, and lo! it spake thus:
> 
> % And your regex will break on it too.  For instance:
> 
> [snipped]
> Because of the single space before the day in each header, right?  If
> that's the case note that I noted it and didn't guarantee it ;-)

No, because the first 'content' line of the body of the message is an
unescaped otherwise-valid From_ line.  Using your regex (or the more
simplisting /^From /), it would be identified as a seperate message,
rather than part of the actual message that it is.  The ONLY way to 'get
it right' that I can see is to trust the Content-Length: header.

(The problem that cropped up in my test parse. "Hey, this is my 'sent'
folder...  why are there messages from people OTHER than me?
Waitaminute")



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

"The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet"



Re: Escaping "From" separator line in an mbox

2001-12-27 Thread David T-G

Matthew --

...and then Matthew D. Fuller said...
% 
% On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of
% David T-G, and lo! it spake thus:
% > % 
% > % But it's got bare "^From " lines  in mid-message where they 'naturally'
% > % appeared.  So, either you need a bit more smarts than just "^From ", or
% > % mutt doesn't write 'sent' as a true mbox.
% > 
% > And I trust that this all works when you open it with mutt, right?  [Hey,
% > it never hurts to check.]
% 
% It works just fine with mutt.

That's good :-)


% And your regex will break on it too.  For instance:
% (from forwarding on a newsgroup post, some names changed to protect the
% guilty)

[snipped]
Because of the single space before the day in each header, right?  If
that's the case note that I noted it and didn't guarantee it ;-)


% 
% Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd

Ahhh...  That would do it.  You ought to try my C-L: strip suggestion to
see if that's the case and how it breaks otherwise.

I wonder if that's a compile-time option.  That is, I wonder if my
version supports it, too.  Since I haven't told my MDA/LDA to do so, I
don't think it's used in favor of ^>From_ when the messages arrive in
either case, but we can try some pathological examples to find out...


% pretty much have to do I guess.  To me, it just seems like putting too
% much trust in the LDA, whatever that may be, but...  Then again, why not
% trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
% ;)

*grin*


% 
% -- 
% Matthew Fuller (MF4839) |[EMAIL PROTECTED]
% Unix Systems Administrator  |[EMAIL PROTECTED]
% Specializing in FreeBSD |http://www.over-yonder.net/
% 
% "The only reason I'm burning my candle at both ends, is because I
%   haven't figured out how to light the middle yet"

Thanks again!


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21922/pgp0.pgp
Description: PGP signature


Re: Escaping "From" separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of
David T-G, and lo! it spake thus:
> % 
> % I was just testing some mbox-parsing code the other day, and I needed a
> % quick mbox of reasonable size to test it against.  Hey, how about
> % ~/mail/sent?
> 
> One would think so...
> 
> 
> % 
> % But it's got bare "^From " lines  in mid-message where they 'naturally'
> % appeared.  So, either you need a bit more smarts than just "^From ", or
> % mutt doesn't write 'sent' as a true mbox.
> 
> And I trust that this all works when you open it with mutt, right?  [Hey,
> it never hurts to check.]

It works just fine with mutt.
And your regex will break on it too.  For instance:
(from forwarding on a newsgroup post, some names changed to protect the
guilty)
---
>From [EMAIL PROTECTED] Tue Jan 12 08:05:47 1999
Message-ID: <[EMAIL PROTECTED]>
Date: Tue, 12 Jan 1999 08:05:47 -0600
From: Me <[EMAIL PROTECTED]>
To: You <[EMAIL PROTECTED]>
Subject: Numero Uno from Matt's Arhives
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.91.1i
X-WorldsBestEditor: vi
Status: RO
Content-Length: 4877
Lines: 103


>From [EMAIL PROTECTED] Mon Sep  7 19:35:34 1998
Path:
news.futuresouth.com!news.futuresouth.com!dca1-feed3.news.digex.net!digex!
newsfeed.axxsys.net!newspump.monmouth.com!newspeer.monmouth.com!intgwpad.nntp.te
lstra.net!nsw.nntp.telstra.net!news.syd.connect.com.au!news.mel.connect.com.au!u
nico.com.au!thorfinn
From: [EMAIL PROTECTED] (Thorfinn)
Newsgroups: alt.sysadmin.recovery

[...]
---

Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd
pretty much have to do I guess.  To me, it just seems like putting too
much trust in the LDA, whatever that may be, but...  Then again, why not
trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
;)



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

"The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet"



Re: Escaping "From" separator line in an mbox

2001-12-27 Thread David T-G

Matthew, et al --

...and then Matthew D. Fuller said...
% 
% On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of
% David T-G, and lo! it spake thus:
% > 
% > Thus, it should be sufficient to match on any ^From_ line as long as
% > you're working with an mbox file (which you can confirm by checking the
...
% 
% Note that this can (also) break.

So I hear!


% 
% I was just testing some mbox-parsing code the other day, and I needed a
% quick mbox of reasonable size to test it against.  Hey, how about
% ~/mail/sent?

One would think so...


% 
% But it's got bare "^From " lines  in mid-message where they 'naturally'
% appeared.  So, either you need a bit more smarts than just "^From ", or
% mutt doesn't write 'sent' as a true mbox.

And I trust that this all works when you open it with mutt, right?  [Hey,
it never hurts to check.]


% 
% The 'mbox' manpage from qmail says:
% ---
% MESSAGE FORMAT
%  A message encoded in mbox format begins with a  From_  line,
%  continues  with a series of non-From_ lines, and ends with a
%  blank line.  A From_ line means any line  that  begins  with
%  the characters F, r, o, m, space:
% 
%  [...]
% ---
% 
% Which seems to imply the POV that "^From " should be a sufficient pattern
% (in which case, watch out for your sent box!)

Yes, indeed.


% 
% Mutt seems to use a bit more smarts.  See "is_from()" in from.c for
% details.

At the very least, Philip now has a more solid regexp definition:

  From [  ] [  ] 

would probably turn into something like

  ^From ([^\t\s@][^\t\s@]*@[^\t\s@][^\t\s@]*\.[^\t\s@][^\t\s@]*|)  \
(Sun|Mon|Tue|Wed|Thu|Fri|Sat) \
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \
[\s1-3][0-9] [01][0-9]:[0-5][0-9]:[0-5][0-9] \
([A-Z][A-Z][A-Z] |) [0-9][0-9][0-9][0-9]

(yes, I've faked it with line breaks just to keep things readable; note
the two spaces at the end of the first line although it may not really
matter and [\s]* should perhaps be used instead).  No, I'm not going into
MIME-encoding of the header as seen in some ^From: lines.  No, this
doesn't allow for leap seconds (but *probably* all one needs is to add a
6 to the seconds regexp).  No, this will break at year 1; apparently
y2k taught me nothing :-)


% 
% -- 
% Matthew Fuller (MF4839) |[EMAIL PROTECTED]
% Unix Systems Administrator  |[EMAIL PROTECTED]
% Specializing in FreeBSD |http://www.over-yonder.net/
% 
% "The only reason I'm burning my candle at both ends, is because I
%   haven't figured out how to light the middle yet"

HTH & HAND & Happy Holidays to all


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21919/pgp0.pgp
Description: PGP signature


Re: Escaping "From" separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of
David T-G, and lo! it spake thus:
> 
> Thus, it should be sufficient to match on any ^From_ line as long as
> you're working with an mbox file (which you can confirm by checking the
> very first line of the file, which should tell you one way or another
> regardless of whether or not the mbox file has one or more messages in
> it) and then also ignore any ^>From_ that you might find, and not worry
> about ^From_ if you're not in an mbox file.

Note that this can (also) break.

I was just testing some mbox-parsing code the other day, and I needed a
quick mbox of reasonable size to test it against.  Hey, how about
~/mail/sent?

But it's got bare "^From " lines  in mid-message where they 'naturally'
appeared.  So, either you need a bit more smarts than just "^From ", or
mutt doesn't write 'sent' as a true mbox.

The 'mbox' manpage from qmail says:
---
MESSAGE FORMAT
 A message encoded in mbox format begins with a  From_  line,
 continues  with a series of non-From_ lines, and ends with a
 blank line.  A From_ line means any line  that  begins  with
 the characters F, r, o, m, space:

 [...]
---

Which seems to imply the POV that "^From " should be a sufficient pattern
(in which case, watch out for your sent box!)

Mutt seems to use a bit more smarts.  See "is_from()" in from.c for
details.



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

"The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet"



Re: Escaping "From" separator line in an mbox

2001-12-27 Thread David T-G

Philip --

...and then Philip Mak said...
% 
% On Wed, 26 Dec 2001, David T-G wrote:
% 
% > Your MDA will also escape any ^From_ in the body to avoid confusion with
% > a message separator line -- if it's delivering to an mbox file.
% 
% That doesn't seem to be true. For example, in one of my sent-mail files
% from pine, I saw this line (there was no ">" before it):
% 
% >From 66.28.28.22: Destination Host Unreachable

Very interesting...


% 
% pine knows not to recognize it as a "From" line, so I'm thinking that pine
% makes sure that it also has a date like "Mon Nov 26 06:33:50 2001" on it.

Hmmm...  It certainly might really do that, but it might also honor
the Content-Length: header and only look for a new message at 
bytes forward of the beginning of the last one.  I thought that only
Sun's dtmail did that (and I know that it does it buggily, which is why
everyone in the Sun circles recommends that you turn off that feature
and go back to seeing ^>From_ in the message body since dtmail doesn't
speak maildir).  You might see if there's a C-L: header and, if so,
copy a couple of test messages with this one in the middle off to a test
mailbox, get rid of the header, and see if it breaks...


% My current best guess for a regexp to match a message separator line is
% this:
% 
% /^From (\s*[^ ]+\s+... ... .. ..:..:.. )/
% 
% but I'm wondering if there might be obscure cases in which it breaks.

I dunno; I've only ever been simple enough to have been fooled by your
example above :-)


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21915/pgp0.pgp
Description: PGP signature


Re: Escaping "From" separator line in an mbox

2001-12-26 Thread Philip Mak

On Wed, 26 Dec 2001, David T-G wrote:

> Your MDA will also escape any ^From_ in the body to avoid confusion with
> a message separator line -- if it's delivering to an mbox file.

That doesn't seem to be true. For example, in one of my sent-mail files
from pine, I saw this line (there was no ">" before it):

>From 66.28.28.22: Destination Host Unreachable

pine knows not to recognize it as a "From" line, so I'm thinking that pine
makes sure that it also has a date like "Mon Nov 26 06:33:50 2001" on it.
My current best guess for a regexp to match a message separator line is
this:

/^From (\s*[^ ]+\s+... ... .. ..:..:.. )/

but I'm wondering if there might be obscure cases in which it breaks.




Re: Escaping "From" separator line in an mbox

2001-12-26 Thread David T-G

Philip, et al --

...and then Philip Mak said...
% 
% Regarding the "From [EMAIL PROTECTED] Wed Jun 06 18:44:53
% 2001" lines in an mbox file...

Yep.  Note that they're only in an mbox file, too.


% 
% What is the regular expression for matching whether the line in an mbox
% file is the beginning of a new message?

Your MDA will put that ^From_ line when it delivers to an mbox file, but
it won't otherwise (check a Maildir message's file to see).

Your MDA will also escape any ^From_ in the body to avoid confusion with
a message separator line -- if it's delivering to an mbox file.

Thus, it should be sufficient to match on any ^From_ line as long as
you're working with an mbox file (which you can confirm by checking the
very first line of the file, which should tell you one way or another
regardless of whether or not the mbox file has one or more messages in
it) and then also ignore any ^>From_ that you might find, and not worry
about ^From_ if you're not in an mbox file.


HTH & HAND & Happy Holidays to all

:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21899/pgp0.pgp
Description: PGP signature


Escaping "From" separator line in an mbox

2001-12-26 Thread Philip Mak

Regarding the "From [EMAIL PROTECTED] Wed Jun 06 18:44:53
2001" lines in an mbox file...

What is the regular expression for matching whether the line in an mbox
file is the beginning of a new message?

What is the regular expression for matching lines like ">From" that should
have the ">" removed before being displayed?

I've been trying to figure it out, but I couldn't find an RFC on it. It
seems to be more complicated than simply /^From ./. This is what I've come
up with so far, but I may be wrong:

/^From (\s*[^ ]+\s+... ... .. ..:..:.. )/

I have the feeling that not all MUAs/MTAs are consistent in how they
handle this, because e.g. when I send an e-mail to a mailing list that has
a line beginning with "From", when I get my message back it turns into
">From" (when being displayed by the MUA to me)!

Detecting the former is more important than the latter, since if I get the
latter wrong, it just means an extra ">" or a missing ">" in the message,
which doesn't matter unless it was a binary encoded file that had "From"
at the beginning of a line (unlikely). But if I get the former wrong, then
a whole message can get messed up.