Re: Escaping From separator line in an mbox

2001-12-27 Thread David T-G

Philip --

...and then Philip Mak said...
% 
% On Wed, 26 Dec 2001, David T-G wrote:
% 
%  Your MDA will also escape any ^From_ in the body to avoid confusion with
%  a message separator line -- if it's delivering to an mbox file.
% 
% That doesn't seem to be true. For example, in one of my sent-mail files
% from pine, I saw this line (there was no  before it):
% 
% From 66.28.28.22: Destination Host Unreachable

Very interesting...


% 
% pine knows not to recognize it as a From line, so I'm thinking that pine
% makes sure that it also has a date like Mon Nov 26 06:33:50 2001 on it.

Hmmm...  It certainly might really do that, but it might also honor
the Content-Length: header and only look for a new message at n
bytes forward of the beginning of the last one.  I thought that only
Sun's dtmail did that (and I know that it does it buggily, which is why
everyone in the Sun circles recommends that you turn off that feature
and go back to seeing ^From_ in the message body since dtmail doesn't
speak maildir).  You might see if there's a C-L: header and, if so,
copy a couple of test messages with this one in the middle off to a test
mailbox, get rid of the header, and see if it breaks...


% My current best guess for a regexp to match a message separator line is
% this:
% 
% /^From (\s*[^ ]+\s+... ... .. ..:..:.. )/
% 
% but I'm wondering if there might be obscure cases in which it breaks.

I dunno; I've only ever been simple enough to have been fooled by your
example above :-)


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21915/pgp0.pgp
Description: PGP signature


Re: Escaping From separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of
David T-G, and lo! it spake thus:
 
 Thus, it should be sufficient to match on any ^From_ line as long as
 you're working with an mbox file (which you can confirm by checking the
 very first line of the file, which should tell you one way or another
 regardless of whether or not the mbox file has one or more messages in
 it) and then also ignore any ^From_ that you might find, and not worry
 about ^From_ if you're not in an mbox file.

Note that this can (also) break.

I was just testing some mbox-parsing code the other day, and I needed a
quick mbox of reasonable size to test it against.  Hey, how about
~/mail/sent?

But it's got bare ^From  lines  in mid-message where they 'naturally'
appeared.  So, either you need a bit more smarts than just ^From , or
mutt doesn't write 'sent' as a true mbox.

The 'mbox' manpage from qmail says:
---
MESSAGE FORMAT
 A message encoded in mbox format begins with a  From_  line,
 continues  with a series of non-From_ lines, and ends with a
 blank line.  A From_ line means any line  that  begins  with
 the characters F, r, o, m, space:

 [...]
---

Which seems to imply the POV that ^From  should be a sufficient pattern
(in which case, watch out for your sent box!)

Mutt seems to use a bit more smarts.  See is_from() in from.c for
details.



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet



Re: Escaping From separator line in an mbox

2001-12-27 Thread David T-G

Matthew, et al --

...and then Matthew D. Fuller said...
% 
% On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of
% David T-G, and lo! it spake thus:
%  
%  Thus, it should be sufficient to match on any ^From_ line as long as
%  you're working with an mbox file (which you can confirm by checking the
...
% 
% Note that this can (also) break.

So I hear!


% 
% I was just testing some mbox-parsing code the other day, and I needed a
% quick mbox of reasonable size to test it against.  Hey, how about
% ~/mail/sent?

One would think so...


% 
% But it's got bare ^From  lines  in mid-message where they 'naturally'
% appeared.  So, either you need a bit more smarts than just ^From , or
% mutt doesn't write 'sent' as a true mbox.

And I trust that this all works when you open it with mutt, right?  [Hey,
it never hurts to check.]


% 
% The 'mbox' manpage from qmail says:
% ---
% MESSAGE FORMAT
%  A message encoded in mbox format begins with a  From_  line,
%  continues  with a series of non-From_ lines, and ends with a
%  blank line.  A From_ line means any line  that  begins  with
%  the characters F, r, o, m, space:
% 
%  [...]
% ---
% 
% Which seems to imply the POV that ^From  should be a sufficient pattern
% (in which case, watch out for your sent box!)

Yes, indeed.


% 
% Mutt seems to use a bit more smarts.  See is_from() in from.c for
% details.

At the very least, Philip now has a more solid regexp definition:

  From [ return-path ] weekday month day time [ timezone ] year

would probably turn into something like

  ^From ([^\t\s@][^\t\s@]*@[^\t\s@][^\t\s@]*\.[^\t\s@][^\t\s@]*|)  \
(Sun|Mon|Tue|Wed|Thu|Fri|Sat) \
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \
[\s1-3][0-9] [01][0-9]:[0-5][0-9]:[0-5][0-9] \
([A-Z][A-Z][A-Z] |) [0-9][0-9][0-9][0-9]

(yes, I've faked it with line breaks just to keep things readable; note
the two spaces at the end of the first line although it may not really
matter and [\s]* should perhaps be used instead).  No, I'm not going into
MIME-encoding of the header as seen in some ^From: lines.  No, this
doesn't allow for leap seconds (but *probably* all one needs is to add a
6 to the seconds regexp).  No, this will break at year 1; apparently
y2k taught me nothing :-)


% 
% -- 
% Matthew Fuller (MF4839) |[EMAIL PROTECTED]
% Unix Systems Administrator  |[EMAIL PROTECTED]
% Specializing in FreeBSD |http://www.over-yonder.net/
% 
% The only reason I'm burning my candle at both ends, is because I
%   haven't figured out how to light the middle yet

HTH  HAND  Happy Holidays to all


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21919/pgp0.pgp
Description: PGP signature


Re: Escaping From separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of
David T-G, and lo! it spake thus:
 % 
 % I was just testing some mbox-parsing code the other day, and I needed a
 % quick mbox of reasonable size to test it against.  Hey, how about
 % ~/mail/sent?
 
 One would think so...
 
 
 % 
 % But it's got bare ^From  lines  in mid-message where they 'naturally'
 % appeared.  So, either you need a bit more smarts than just ^From , or
 % mutt doesn't write 'sent' as a true mbox.
 
 And I trust that this all works when you open it with mutt, right?  [Hey,
 it never hurts to check.]

It works just fine with mutt.
And your regex will break on it too.  For instance:
(from forwarding on a newsgroup post, some names changed to protect the
guilty)
---
From [EMAIL PROTECTED] Tue Jan 12 08:05:47 1999
Message-ID: [EMAIL PROTECTED]
Date: Tue, 12 Jan 1999 08:05:47 -0600
From: Me [EMAIL PROTECTED]
To: You [EMAIL PROTECTED]
Subject: Numero Uno from Matt's Arhives
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.91.1i
X-WorldsBestEditor: vi
Status: RO
Content-Length: 4877
Lines: 103


From [EMAIL PROTECTED] Mon Sep  7 19:35:34 1998
Path:
news.futuresouth.com!news.futuresouth.com!dca1-feed3.news.digex.net!digex!
newsfeed.axxsys.net!newspump.monmouth.com!newspeer.monmouth.com!intgwpad.nntp.te
lstra.net!nsw.nntp.telstra.net!news.syd.connect.com.au!news.mel.connect.com.au!u
nico.com.au!thorfinn
From: [EMAIL PROTECTED] (Thorfinn)
Newsgroups: alt.sysadmin.recovery

[...]
---

Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd
pretty much have to do I guess.  To me, it just seems like putting too
much trust in the LDA, whatever that may be, but...  Then again, why not
trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
;)



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet



Re: Escaping From separator line in an mbox

2001-12-27 Thread David T-G

Matthew --

...and then Matthew D. Fuller said...
% 
% On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of
% David T-G, and lo! it spake thus:
%  % 
%  % But it's got bare ^From  lines  in mid-message where they 'naturally'
%  % appeared.  So, either you need a bit more smarts than just ^From , or
%  % mutt doesn't write 'sent' as a true mbox.
%  
%  And I trust that this all works when you open it with mutt, right?  [Hey,
%  it never hurts to check.]
% 
% It works just fine with mutt.

That's good :-)


% And your regex will break on it too.  For instance:
% (from forwarding on a newsgroup post, some names changed to protect the
% guilty)

[snipped]
Because of the single space before the day in each header, right?  If
that's the case note that I noted it and didn't guarantee it ;-)


% 
% Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd

Ahhh...  That would do it.  You ought to try my C-L: strip suggestion to
see if that's the case and how it breaks otherwise.

I wonder if that's a compile-time option.  That is, I wonder if my
version supports it, too.  Since I haven't told my MDA/LDA to do so, I
don't think it's used in favor of ^From_ when the messages arrive in
either case, but we can try some pathological examples to find out...


% pretty much have to do I guess.  To me, it just seems like putting too
% much trust in the LDA, whatever that may be, but...  Then again, why not
% trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
% ;)

*grin*


% 
% -- 
% Matthew Fuller (MF4839) |[EMAIL PROTECTED]
% Unix Systems Administrator  |[EMAIL PROTECTED]
% Specializing in FreeBSD |http://www.over-yonder.net/
% 
% The only reason I'm burning my candle at both ends, is because I
%   haven't figured out how to light the middle yet

Thanks again!


:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21922/pgp0.pgp
Description: PGP signature


Re: Escaping From separator line in an mbox

2001-12-27 Thread Matthew D. Fuller

On Thu, Dec 27, 2001 at 06:54:57AM -0500 I heard the voice of
David T-G, and lo! it spake thus:
 
 % And your regex will break on it too.  For instance:
 
 [snipped]
 Because of the single space before the day in each header, right?  If
 that's the case note that I noted it and didn't guarantee it ;-)

No, because the first 'content' line of the body of the message is an
unescaped otherwise-valid From_ line.  Using your regex (or the more
simplisting /^From /), it would be identified as a seperate message,
rather than part of the actual message that it is.  The ONLY way to 'get
it right' that I can see is to trust the Content-Length: header.

(The problem that cropped up in my test parse. Hey, this is my 'sent'
folder...  why are there messages from people OTHER than me?
Waitaminute)



-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet



Re: Escaping From separator line in an mbox

2001-12-27 Thread Philip Mak

On Thu, 27 Dec 2001, Matthew D. Fuller wrote:

 Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd
 pretty much have to do I guess.  To me, it just seems like putting too
 much trust in the LDA, whatever that may be, but...  Then again, why not
 trust?  mbox is fragile as hell anyway, what's one more shaky assumption?
 ;)

Looking in my sent-mail folder from pine that had a message with unescaped
From 66.28.28.22: Destination Host Unreachable, it did not have a
Content-Length header. Here is the headers for that message:

From [EMAIL PROTECTED] Sat Nov 10 03:17:28 2001 -0500
Date: Sat, 10 Nov 2001 03:17:28 -0500 (EST)
From: Philip Mak [EMAIL PROTECTED]
X-Sender:  [EMAIL PROTECTED]
To:  [EMAIL PROTECTED]
cc: James Ventrillo [EMAIL PROTECTED],
Mike Little [EMAIL PROTECTED]
Subject: IP address problems on buildreferrals.com
Message-ID: [EMAIL PROTECTED]
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Status:
X-Keywords:
X-UID: 47

I'm guessing that mutt/pine/etc. use some best effort heuristics to
determine when a From  line is a message separator. For example:

- A message separator only occurs after a blank line.
- A message separator contains From , an envelope sender address (as
  defined in RFC822 appendix d addrspec), whitespace, and a timestamp
  (weekday month day time [timezone] year).

It seems that there is not a reliable mechanism for unescaping From
lines; I've found out that if I send a message that says From to myself
(using pine with mbox), it will become From in some cases. I'm guessing
this is one of those things that should have been standardized, but
everyone just did it ad hoc and now it's a mess.

man mbox on my system says:

   In order to avoid mis-interpretation of lines  in  message
   bodies  which  begin with the four characters From, fol
   lowed by a space character, the character  is  commonly
   prepended in front of such lines.

It says commonly prepended, which implies that it doesn't have to be. :(

So it would seem that for the mbox to Maildir conversion program that I'm
writing, the best thing that I can manage is to make it recognize a From
line as a message separator based on those two heuristics (preceding blank
line, and correct syntax)  above.




Re: Escaping From separator line in an mbox

2001-12-26 Thread David T-G

Philip, et al --

...and then Philip Mak said...
% 
% Regarding the From [EMAIL PROTECTED] Wed Jun 06 18:44:53
% 2001 lines in an mbox file...

Yep.  Note that they're only in an mbox file, too.


% 
% What is the regular expression for matching whether the line in an mbox
% file is the beginning of a new message?

Your MDA will put that ^From_ line when it delivers to an mbox file, but
it won't otherwise (check a Maildir message's file to see).

Your MDA will also escape any ^From_ in the body to avoid confusion with
a message separator line -- if it's delivering to an mbox file.

Thus, it should be sufficient to match on any ^From_ line as long as
you're working with an mbox file (which you can confirm by checking the
very first line of the file, which should tell you one way or another
regardless of whether or not the mbox file has one or more messages in
it) and then also ignore any ^From_ that you might find, and not worry
about ^From_ if you're not in an mbox file.


HTH  HAND  Happy Holidays to all

:-D
-- 
David T-G  * It's easier to fight for one's principles
(play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie
(work) [EMAIL PROTECTED]
http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg!




msg21899/pgp0.pgp
Description: PGP signature


Re: Escaping From separator line in an mbox

2001-12-26 Thread Philip Mak

On Wed, 26 Dec 2001, David T-G wrote:

 Your MDA will also escape any ^From_ in the body to avoid confusion with
 a message separator line -- if it's delivering to an mbox file.

That doesn't seem to be true. For example, in one of my sent-mail files
from pine, I saw this line (there was no  before it):

From 66.28.28.22: Destination Host Unreachable

pine knows not to recognize it as a From line, so I'm thinking that pine
makes sure that it also has a date like Mon Nov 26 06:33:50 2001 on it.
My current best guess for a regexp to match a message separator line is
this:

/^From (\s*[^ ]+\s+... ... .. ..:..:.. )/

but I'm wondering if there might be obscure cases in which it breaks.