Re: Escaping "From" separator line in an mbox
On Thu, 27 Dec 2001, Matthew D. Fuller wrote: > Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd > pretty much have to do I guess. To me, it just seems like putting too > much trust in the LDA, whatever that may be, but... Then again, why not > trust? mbox is fragile as hell anyway, what's one more shaky assumption? > ;) Looking in my sent-mail folder from pine that had a message with unescaped "From 66.28.28.22: Destination Host Unreachable", it did not have a Content-Length header. Here is the headers for that message: >From [EMAIL PROTECTED] Sat Nov 10 03:17:28 2001 -0500 Date: Sat, 10 Nov 2001 03:17:28 -0500 (EST) From: Philip Mak <[EMAIL PROTECTED]> X-Sender: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> cc: James Ventrillo <[EMAIL PROTECTED]>, Mike Little <[EMAIL PROTECTED]> Subject: IP address problems on buildreferrals.com Message-ID: <[EMAIL PROTECTED]> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Status: O X-Status: X-Keywords: X-UID: 47 I'm guessing that mutt/pine/etc. use some best effort heuristics to determine when a "From " line is a message separator. For example: - A message separator only occurs after a blank line. - A message separator contains "From ", an envelope sender address (as defined in RFC822 appendix d "addrspec"), whitespace, and a timestamp (weekday month day time [timezone] year). It seems that there is not a reliable mechanism for unescaping ">From" lines; I've found out that if I send a message that says "From" to myself (using pine with mbox), it will become ">From" in some cases. I'm guessing this is one of those things that should have been standardized, but everyone just did it ad hoc and now it's a mess. "man mbox" on my system says: In order to avoid mis-interpretation of lines in message bodies which begin with the four characters "From", fol lowed by a space character, the character ">" is commonly prepended in front of such lines. It says "commonly prepended", which implies that it doesn't have to be. :( So it would seem that for the mbox to Maildir conversion program that I'm writing, the best thing that I can manage is to make it recognize a "From" line as a message separator based on those two heuristics (preceding blank line, and correct syntax) above.
Re: Escaping "From" separator line in an mbox
On Thu, Dec 27, 2001 at 06:54:57AM -0500 I heard the voice of David T-G, and lo! it spake thus: > > % And your regex will break on it too. For instance: > > [snipped] > Because of the single space before the day in each header, right? If > that's the case note that I noted it and didn't guarantee it ;-) No, because the first 'content' line of the body of the message is an unescaped otherwise-valid From_ line. Using your regex (or the more simplisting /^From /), it would be identified as a seperate message, rather than part of the actual message that it is. The ONLY way to 'get it right' that I can see is to trust the Content-Length: header. (The problem that cropped up in my test parse. "Hey, this is my 'sent' folder... why are there messages from people OTHER than me? Waitaminute") -- Matthew Fuller (MF4839) |[EMAIL PROTECTED] Unix Systems Administrator |[EMAIL PROTECTED] Specializing in FreeBSD |http://www.over-yonder.net/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet"
Re: Escaping "From" separator line in an mbox
Matthew -- ...and then Matthew D. Fuller said... % % On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of % David T-G, and lo! it spake thus: % > % % > % But it's got bare "^From " lines in mid-message where they 'naturally' % > % appeared. So, either you need a bit more smarts than just "^From ", or % > % mutt doesn't write 'sent' as a true mbox. % > % > And I trust that this all works when you open it with mutt, right? [Hey, % > it never hurts to check.] % % It works just fine with mutt. That's good :-) % And your regex will break on it too. For instance: % (from forwarding on a newsgroup post, some names changed to protect the % guilty) [snipped] Because of the single space before the day in each header, right? If that's the case note that I noted it and didn't guarantee it ;-) % % Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd Ahhh... That would do it. You ought to try my C-L: strip suggestion to see if that's the case and how it breaks otherwise. I wonder if that's a compile-time option. That is, I wonder if my version supports it, too. Since I haven't told my MDA/LDA to do so, I don't think it's used in favor of ^>From_ when the messages arrive in either case, but we can try some pathological examples to find out... % pretty much have to do I guess. To me, it just seems like putting too % much trust in the LDA, whatever that may be, but... Then again, why not % trust? mbox is fragile as hell anyway, what's one more shaky assumption? % ;) *grin* % % -- % Matthew Fuller (MF4839) |[EMAIL PROTECTED] % Unix Systems Administrator |[EMAIL PROTECTED] % Specializing in FreeBSD |http://www.over-yonder.net/ % % "The only reason I'm burning my candle at both ends, is because I % haven't figured out how to light the middle yet" Thanks again! :-D -- David T-G * It's easier to fight for one's principles (play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie (work) [EMAIL PROTECTED] http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg! msg21922/pgp0.pgp Description: PGP signature
Re: Escaping "From" separator line in an mbox
On Thu, Dec 27, 2001 at 06:39:56AM -0500 I heard the voice of David T-G, and lo! it spake thus: > % > % I was just testing some mbox-parsing code the other day, and I needed a > % quick mbox of reasonable size to test it against. Hey, how about > % ~/mail/sent? > > One would think so... > > > % > % But it's got bare "^From " lines in mid-message where they 'naturally' > % appeared. So, either you need a bit more smarts than just "^From ", or > % mutt doesn't write 'sent' as a true mbox. > > And I trust that this all works when you open it with mutt, right? [Hey, > it never hurts to check.] It works just fine with mutt. And your regex will break on it too. For instance: (from forwarding on a newsgroup post, some names changed to protect the guilty) --- >From [EMAIL PROTECTED] Tue Jan 12 08:05:47 1999 Message-ID: <[EMAIL PROTECTED]> Date: Tue, 12 Jan 1999 08:05:47 -0600 From: Me <[EMAIL PROTECTED]> To: You <[EMAIL PROTECTED]> Subject: Numero Uno from Matt's Arhives Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.91.1i X-WorldsBestEditor: vi Status: RO Content-Length: 4877 Lines: 103 >From [EMAIL PROTECTED] Mon Sep 7 19:35:34 1998 Path: news.futuresouth.com!news.futuresouth.com!dca1-feed3.news.digex.net!digex! newsfeed.axxsys.net!newspump.monmouth.com!newspeer.monmouth.com!intgwpad.nntp.te lstra.net!nsw.nntp.telstra.net!news.syd.connect.com.au!news.mel.connect.com.au!u nico.com.au!thorfinn From: [EMAIL PROTECTED] (Thorfinn) Newsgroups: alt.sysadmin.recovery [...] --- Mutt, I guess, outsmarts the mbox by reading Content-Length:, which you'd pretty much have to do I guess. To me, it just seems like putting too much trust in the LDA, whatever that may be, but... Then again, why not trust? mbox is fragile as hell anyway, what's one more shaky assumption? ;) -- Matthew Fuller (MF4839) |[EMAIL PROTECTED] Unix Systems Administrator |[EMAIL PROTECTED] Specializing in FreeBSD |http://www.over-yonder.net/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet"
Re: Escaping "From" separator line in an mbox
Matthew, et al -- ...and then Matthew D. Fuller said... % % On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of % David T-G, and lo! it spake thus: % > % > Thus, it should be sufficient to match on any ^From_ line as long as % > you're working with an mbox file (which you can confirm by checking the ... % % Note that this can (also) break. So I hear! % % I was just testing some mbox-parsing code the other day, and I needed a % quick mbox of reasonable size to test it against. Hey, how about % ~/mail/sent? One would think so... % % But it's got bare "^From " lines in mid-message where they 'naturally' % appeared. So, either you need a bit more smarts than just "^From ", or % mutt doesn't write 'sent' as a true mbox. And I trust that this all works when you open it with mutt, right? [Hey, it never hurts to check.] % % The 'mbox' manpage from qmail says: % --- % MESSAGE FORMAT % A message encoded in mbox format begins with a From_ line, % continues with a series of non-From_ lines, and ends with a % blank line. A From_ line means any line that begins with % the characters F, r, o, m, space: % % [...] % --- % % Which seems to imply the POV that "^From " should be a sufficient pattern % (in which case, watch out for your sent box!) Yes, indeed. % % Mutt seems to use a bit more smarts. See "is_from()" in from.c for % details. At the very least, Philip now has a more solid regexp definition: From [ ] [ ] would probably turn into something like ^From ([^\t\s@][^\t\s@]*@[^\t\s@][^\t\s@]*\.[^\t\s@][^\t\s@]*|) \ (Sun|Mon|Tue|Wed|Thu|Fri|Sat) \ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \ [\s1-3][0-9] [01][0-9]:[0-5][0-9]:[0-5][0-9] \ ([A-Z][A-Z][A-Z] |) [0-9][0-9][0-9][0-9] (yes, I've faked it with line breaks just to keep things readable; note the two spaces at the end of the first line although it may not really matter and [\s]* should perhaps be used instead). No, I'm not going into MIME-encoding of the header as seen in some ^From: lines. No, this doesn't allow for leap seconds (but *probably* all one needs is to add a 6 to the seconds regexp). No, this will break at year 1; apparently y2k taught me nothing :-) % % -- % Matthew Fuller (MF4839) |[EMAIL PROTECTED] % Unix Systems Administrator |[EMAIL PROTECTED] % Specializing in FreeBSD |http://www.over-yonder.net/ % % "The only reason I'm burning my candle at both ends, is because I % haven't figured out how to light the middle yet" HTH & HAND & Happy Holidays to all :-D -- David T-G * It's easier to fight for one's principles (play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie (work) [EMAIL PROTECTED] http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg! msg21919/pgp0.pgp Description: PGP signature
Re: Escaping "From" separator line in an mbox
On Wed, Dec 26, 2001 at 09:22:33PM -0500 I heard the voice of David T-G, and lo! it spake thus: > > Thus, it should be sufficient to match on any ^From_ line as long as > you're working with an mbox file (which you can confirm by checking the > very first line of the file, which should tell you one way or another > regardless of whether or not the mbox file has one or more messages in > it) and then also ignore any ^>From_ that you might find, and not worry > about ^From_ if you're not in an mbox file. Note that this can (also) break. I was just testing some mbox-parsing code the other day, and I needed a quick mbox of reasonable size to test it against. Hey, how about ~/mail/sent? But it's got bare "^From " lines in mid-message where they 'naturally' appeared. So, either you need a bit more smarts than just "^From ", or mutt doesn't write 'sent' as a true mbox. The 'mbox' manpage from qmail says: --- MESSAGE FORMAT A message encoded in mbox format begins with a From_ line, continues with a series of non-From_ lines, and ends with a blank line. A From_ line means any line that begins with the characters F, r, o, m, space: [...] --- Which seems to imply the POV that "^From " should be a sufficient pattern (in which case, watch out for your sent box!) Mutt seems to use a bit more smarts. See "is_from()" in from.c for details. -- Matthew Fuller (MF4839) |[EMAIL PROTECTED] Unix Systems Administrator |[EMAIL PROTECTED] Specializing in FreeBSD |http://www.over-yonder.net/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet"
Re: Escaping "From" separator line in an mbox
Philip -- ...and then Philip Mak said... % % On Wed, 26 Dec 2001, David T-G wrote: % % > Your MDA will also escape any ^From_ in the body to avoid confusion with % > a message separator line -- if it's delivering to an mbox file. % % That doesn't seem to be true. For example, in one of my sent-mail files % from pine, I saw this line (there was no ">" before it): % % >From 66.28.28.22: Destination Host Unreachable Very interesting... % % pine knows not to recognize it as a "From" line, so I'm thinking that pine % makes sure that it also has a date like "Mon Nov 26 06:33:50 2001" on it. Hmmm... It certainly might really do that, but it might also honor the Content-Length: header and only look for a new message at bytes forward of the beginning of the last one. I thought that only Sun's dtmail did that (and I know that it does it buggily, which is why everyone in the Sun circles recommends that you turn off that feature and go back to seeing ^>From_ in the message body since dtmail doesn't speak maildir). You might see if there's a C-L: header and, if so, copy a couple of test messages with this one in the middle off to a test mailbox, get rid of the header, and see if it breaks... % My current best guess for a regexp to match a message separator line is % this: % % /^From (\s*[^ ]+\s+... ... .. ..:..:.. )/ % % but I'm wondering if there might be obscure cases in which it breaks. I dunno; I've only ever been simple enough to have been fooled by your example above :-) :-D -- David T-G * It's easier to fight for one's principles (play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie (work) [EMAIL PROTECTED] http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg! msg21915/pgp0.pgp Description: PGP signature
Re: Escaping "From" separator line in an mbox
On Wed, 26 Dec 2001, David T-G wrote: > Your MDA will also escape any ^From_ in the body to avoid confusion with > a message separator line -- if it's delivering to an mbox file. That doesn't seem to be true. For example, in one of my sent-mail files from pine, I saw this line (there was no ">" before it): >From 66.28.28.22: Destination Host Unreachable pine knows not to recognize it as a "From" line, so I'm thinking that pine makes sure that it also has a date like "Mon Nov 26 06:33:50 2001" on it. My current best guess for a regexp to match a message separator line is this: /^From (\s*[^ ]+\s+... ... .. ..:..:.. )/ but I'm wondering if there might be obscure cases in which it breaks.
Re: Escaping "From" separator line in an mbox
Philip, et al -- ...and then Philip Mak said... % % Regarding the "From [EMAIL PROTECTED] Wed Jun 06 18:44:53 % 2001" lines in an mbox file... Yep. Note that they're only in an mbox file, too. % % What is the regular expression for matching whether the line in an mbox % file is the beginning of a new message? Your MDA will put that ^From_ line when it delivers to an mbox file, but it won't otherwise (check a Maildir message's file to see). Your MDA will also escape any ^From_ in the body to avoid confusion with a message separator line -- if it's delivering to an mbox file. Thus, it should be sufficient to match on any ^From_ line as long as you're working with an mbox file (which you can confirm by checking the very first line of the file, which should tell you one way or another regardless of whether or not the mbox file has one or more messages in it) and then also ignore any ^>From_ that you might find, and not worry about ^From_ if you're not in an mbox file. HTH & HAND & Happy Holidays to all :-D -- David T-G * It's easier to fight for one's principles (play) [EMAIL PROTECTED] * than to live up to them. -- fortune cookie (work) [EMAIL PROTECTED] http://www.justpickone.org/davidtg/Shpx gur Pbzzhavpngvbaf Qrprapl Npg! msg21899/pgp0.pgp Description: PGP signature
Escaping "From" separator line in an mbox
Regarding the "From [EMAIL PROTECTED] Wed Jun 06 18:44:53 2001" lines in an mbox file... What is the regular expression for matching whether the line in an mbox file is the beginning of a new message? What is the regular expression for matching lines like ">From" that should have the ">" removed before being displayed? I've been trying to figure it out, but I couldn't find an RFC on it. It seems to be more complicated than simply /^From ./. This is what I've come up with so far, but I may be wrong: /^From (\s*[^ ]+\s+... ... .. ..:..:.. )/ I have the feeling that not all MUAs/MTAs are consistent in how they handle this, because e.g. when I send an e-mail to a mailing list that has a line beginning with "From", when I get my message back it turns into ">From" (when being displayed by the MUA to me)! Detecting the former is more important than the latter, since if I get the latter wrong, it just means an extra ">" or a missing ">" in the message, which doesn't matter unless it was a binary encoded file that had "From" at the beginning of a line (unlikely). But if I get the former wrong, then a whole message can get messed up.