Re: Message body size limits? (Bigger Problem)
Earl Thanks for the advice. I ended up making due with what I had. I'm on a shared server with shell access, but the htmlcheck function kept crashing with an out of memory error. Even tried the mem switch to no avail. The issue was Google's poor wrap of the emails from them. The emails that kept showing as a problem were ones with a incomplete tag for . They were trying to wrap the entire body content in a table, but failed to close the table properly using . New issue on a mailman mail list archive. I'll start a new thread. Tom
Re: Message body size limits? (Bigger Problem)
On Sat, Dec 3, 2011 at 11:51 PM, Alex Teslik wrote: > OpenWebMail has HTML handling and HTML to text conversions specifically for > email. They are tested and could probably be integrated into mhonarc with > minimal effort. > > HTML handling/scrubbing: > http://openwebmail.acatysmoof.com/dev/svnweb/index.pl/openwebmail/view/trunk/src/cgi-bin/openwebmail/modules/htmlrender.pl > > HTML->Text: > http://openwebmail.acatysmoof.com/dev/svnweb/index.pl/openwebmail/view/trunk/src/cgi-bin/openwebmail/modules/htmltext.pl It would be interesting to see what kind of test data has been used to verify how good it is at sanitizing data and how well it handles specially crafted large emails. A quick scan at some of the regexes indicate some things may still get through. If you are interested, you can examine the comments of mhonarc's mhtxthtml.pl filter to get an idea of the crap one has to deal with. --ewh
Re: Message body size limits? (Bigger Problem)
On Sun, Dec 4, 2011 at 12:16 AM, Tom Hutchison wrote: > Is it possible some malformed email could be causing a parsing error? What I > am getting at. If I have 250 emails in a folder, how is it the run on the > folder is writing 260. The extra ten being date and subject blank, > sometimes, and sometimes, with or without content. Your problem is likely related to the following FAQ entry: http://www.mhonarc.org/MHonArc/doc/faq/archives.html#split > When the parser reads them, is it possible Mhonarc is picking up on > malformed reply quotes and thinks they are new emails within the actual > email? So instead of 4 emails in the above example, it thinks there are 6. > Garbage in, garbage out comes to mind. > > I did solve the broken HTML, not very efficently with Outlook 2010 as it > does allow for a striping of all HTML code by setting the open email to > “edit” then choosing “plain text” after you edit anything in the body of IIRC, Outlook allows a text/plain alternative to be generated along with the HTML part. You can use the MIMEALTPREFS resource, as noted in the FAQ, to give higher precedence to text/plain over text/html. > the email. Even if it is just a carriage return or a space. Close the email > and save on exit and the whole email is rewritten, stripping out all HTML > and resetting the header information to show “plain/text” and whatever you > have the encoding set to. Stripping out all HTML from the emails was the > only way I could think of to solve the unclosed attribute in quite a > few emails which was causing problems with the msgxxx.html pages. > > It’s long past time for standardized header and html format for email. If > anything it might secure them more... text/enriched was created a long time ago to provide enhanced formatting of email messages, but it faded away when the Web grew and HTML became a defacto markup format for "enriched" text. IMO, it is inexcusable for major software/services organizations to generate such malformed HTML. Dealing with malicious HTML is one thing, but when non-malicious-generated HTML is so badly formatted (when it should not be) it makes the lives of consumers of such content much more difficult. --ewh
Re: Message body size limits? (Bigger Problem)
Thanks for the reply Is it possible some malformed email could be causing a parsing error? What I am getting at. If I have 250 emails in a folder, how is it the run on the folder is writing 260. The extra ten being date and subject blank, sometimes, and sometimes, with or without content. Here is what I think might be happening. Start- Email 1 -- Email 2 with a malformed reply quote with some header info -- Email 3 with a malformed reply quote with some header info -- Email 4 --end of run When the parser reads them, is it possible Mhonarc is picking up on malformed reply quotes and thinks they are new emails within the actual email? So instead of 4 emails in the above example, it thinks there are 6. Garbage in, garbage out comes to mind. I did solve the broken HTML, not very efficently with Outlook 2010 as it does allow for a striping of all HTML code by setting the open email to “edit” then choosing “plain text” after you edit anything in the body of the email. Even if it is just a carriage return or a space. Close the email and save on exit and the whole email is rewritten, stripping out all HTML and resetting the header information to show “plain/text” and whatever you have the encoding set to. Stripping out all HTML from the emails was the only way I could think of to solve the unclosed attribute in quite a few emails which was causing problems with the msgxxx.html pages. It’s long past time for standardized header and html format for email. If anything it might secure them more... Thanks Tom
Re: Message body size limits? (Bigger Problem)
On Sat, Dec 3, 2011 at 4:00 PM, Tom wrote: > Is Mhonarc not really supported any more or is this list dead? This list is not dead, but does not get much traffic. As for size limits, mhonarc is limited to RAM since each message processed is loaded into memory. Not efficient, but that is a left-over from the initial code base. Making things more efficient would require a considerable rewrite of the core parsing code. There is the -savemem option that can be used to try to reduce the overall memory footprint (it keeps mhonarc from keeping a handle of all new message data in RAM), but it does slow things down due to more file I/O done. As for HTML processing, there is a history of performance issues with the regexes used to strip out markup that can cause security problems, with the possibility of much memory being consumed. This is somewhat an artifact of the regex and perl's regex engine, but I lack the knowledge of perl's regex internals. Attempts over time have been done to try to use regexes that will not trigger engine problems. IMO, HTML mail is horrendous, and the horrible lack of conformancy by major mail providers just contributes to the security problems HTML mail provides. The latest release clamped down on what HTML is acceptable since XSS vulnerability risk increases (and were proven) with previous releases. The FAQ provides information about the security problems of HTML mail, and how to mitigate it. If the solutions provided is not sufficient for your needs, mhonarc is explicitly designed to allow you to register your own handler for HTML messages, giving you full control of how the are processed. You can see if integrating a 3rd party HTML parsing engine can be used. I personally sick of dealing with the markup atrocities of HTML mail, and have no incentive in do any more work on it. However, patches or alternate filters are always welcome and will be considered to be included in mhonarc itself if such contributions have no adverse effects. --ewh
Re: Message body size limits? (Bigger Problem)
Is Mhonarc not really supported any more or is this list dead? Tom On Dec 3, 2011, at 8:57 AM, Tom wrote: > I should have stayed away from Google Groups. I liked the feature of having > a forum they could post from or just email the group. What I don't like now > is the horrendous HTML of their output emails from their first couple of > years when groups was new. Broken tags, especially " Exclusion list, don't get me started- Authenticate-x, Mail-list name, > Domain-key verified, etc... It was almost like they were making them up when > they started. 2010 and 2011 emails are a lot better. > > Ok I know they messed up a lot of things too, but Microsoft has a pretty > powerful email re-writing script in Outlook 2010. You can open a message, > enable editing strip out HTML, then resave it. Seems to clean up and rewrite > the header a little better on the resave. > > What has me perplexed now is the 18 unknown messages. My error log has > dropped significantly. These 18 seem to be truncated emails of longer > emails. I have 287 emails in the folder being read, writing 305 emails on > the run output, exactly the 18 unknown, subject, date, being written at the > end of my index.html page. > > Tom >
Re: Message body size limits? (Bigger Problem)
I should have stayed away from Google Groups. I liked the feature of having a forum they could post from or just email the group. What I don't like now is the horrendous HTML of their output emails from their first couple of years when groups was new. Broken tags, especially " wrote: >> Google should be shot, maybe Yahoo too since almost every >> single message that is blank is from Yahoo. > > =v= GMail has led the way in making content more unreadable, > (thereby breaking parsers), first by altering the message > body to replace consecutive spaces with 8bit nonbreaking > spaces, which of course triggers quoted-printable encoding; > then by unnecessarily encoding a whole lot of characters, > newlines in particular. > > =v= Yahoo! Mail has for years been pretty brilliant about > compatibility, but the last upgrade copied GMail and done > the same thing to newlines. ><_Jym_> >
Re: Message body size limits? (Bigger Problem)
> Google should be shot, maybe Yahoo too since almost every > single message that is blank is from Yahoo. =v= GMail has led the way in making content more unreadable, (thereby breaking parsers), first by altering the message body to replace consecutive spaces with 8bit nonbreaking spaces, which of course triggers quoted-printable encoding; then by unnecessarily encoding a whole lot of characters, newlines in particular. =v= Yahoo! Mail has for years been pretty brilliant about compatibility, but the last upgrade copied GMail and done the same thing to newlines. <_Jym_>
RE: Message body size limits? (Bigger Problem)
I ran an older year, Google should be shot, maybe Yahoo too since almost every single message that is blank is from Yahoo. I invoked the error log, the process was running too fast to see and I also couldn't scroll back far enough to see them all. Here is a sample of what I am getting. Warning: Rejecting Invalid HTML: Nested start tags Message-Id: <328830.92224...@web80202.mail.mud.yahoo.com> Message Subject: Re: Crawford/Crafford/Craford/Crayford/Cranford Message Number: 00015 Warning: Empty body data generated: Message-Id: 328830.92224...@web80202.mail.mud.yahoo.com Message Subject: Re: Crawford/Crafford/Craford/Crayford/Cranford Message Number: 00015 In a mailbox with about 300 messages, I got about 40 of these errors. As a result, no content of the message on the archive's HTML pages. Any thoughts? Help really appreciated. Thanks Tom