Re: Message body size limits? (Bigger Problem)

2011-12-16 Thread Tom Hutchison

Earl

Thanks for the advice. I ended up making due with what I had. I'm on a 
shared server with shell access, but the htmlcheck function kept crashing 
with an out of memory error. Even tried the mem switch to no avail. The 
issue was Google's poor wrap of the emails from them. The emails that kept 
showing as a problem were ones with a incomplete tag for . They were 
trying to wrap the entire body content in a table, but failed to close the 
table properly using .


New issue on a mailman mail list archive. I'll start a new thread.

Tom 



Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Earl Hood
On Sat, Dec 3, 2011 at 11:51 PM, Alex Teslik wrote:

> OpenWebMail has HTML handling and HTML to text conversions specifically for
> email. They are tested and could probably be integrated into mhonarc with
> minimal effort.
>
> HTML handling/scrubbing:
> http://openwebmail.acatysmoof.com/dev/svnweb/index.pl/openwebmail/view/trunk/src/cgi-bin/openwebmail/modules/htmlrender.pl
>
> HTML->Text:
> http://openwebmail.acatysmoof.com/dev/svnweb/index.pl/openwebmail/view/trunk/src/cgi-bin/openwebmail/modules/htmltext.pl

It would be interesting to see what kind of test data has
been used to verify how good it is at sanitizing data and
how well it handles specially crafted large emails.

A quick scan at some of the regexes indicate some things
may still get through.  If you are interested, you can
examine the comments of mhonarc's mhtxthtml.pl filter to
get an idea of the crap one has to deal with.

--ewh



Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Earl Hood
On Sun, Dec 4, 2011 at 12:16 AM, Tom Hutchison  wrote:
> Is it possible some malformed email could be causing a parsing error? What I
> am getting at. If I have 250 emails in a folder, how is it the run on the
> folder is writing 260. The extra ten being date and subject blank,
> sometimes, and sometimes, with or without content.

Your problem is likely related to the following FAQ entry:

http://www.mhonarc.org/MHonArc/doc/faq/archives.html#split

> When the parser reads them, is it possible Mhonarc is picking up on
> malformed reply quotes and thinks they are new emails within the actual
> email? So instead of 4 emails in the above example, it thinks there are 6.
> Garbage in, garbage out comes to mind.
>
> I did solve the broken HTML, not very efficently with Outlook 2010 as it
> does allow for a striping of all HTML code by setting the open email to
> “edit” then choosing “plain text”  after you edit anything in the body of

IIRC, Outlook allows a text/plain alternative to be generated along
with the HTML part.  You can use the MIMEALTPREFS resource, as noted
in the FAQ, to give higher precedence to text/plain over text/html.

> the email. Even if it is just a carriage return or a space. Close the email
> and save on exit and the whole email is rewritten, stripping out all HTML
> and resetting the header information to show “plain/text” and whatever you
> have the encoding set to. Stripping out all HTML from the emails was the
> only way I could think of to solve the unclosed  attribute in quite a
> few emails which was causing problems with the msgxxx.html pages.
>
> It’s long past time for standardized header and html format for email. If
> anything it might secure them more...

text/enriched was created a long time ago to provide enhanced formatting
of email messages, but it faded away when the Web grew and HTML became
a defacto markup format for "enriched" text.

IMO, it is inexcusable for major software/services organizations to generate
such malformed HTML.  Dealing with malicious HTML is one thing, but when
non-malicious-generated HTML is so badly formatted (when it should not
be) it makes the lives of consumers of such content much more difficult.

--ewh



Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Tom Hutchison
Thanks for the reply

Is it possible some malformed email could be causing a parsing error? What I am 
getting at. If I have 250 emails in a folder, how is it the run on the folder 
is writing 260. The extra ten being date and subject blank, sometimes, and 
sometimes, with or without content. 

Here is what I think might be happening.

Start-
Email 1
--
Email 2
with a malformed reply quote with some header info
--
Email 3
  with a malformed reply quote with some header info
--
Email 4
--end of run

When the parser reads them, is it possible Mhonarc is picking up on malformed 
reply quotes and thinks they are new emails within the actual email? So instead 
of 4 emails in the above example, it thinks there are 6. Garbage in, garbage 
out comes to mind.

I did solve the broken HTML, not very efficently with Outlook 2010 as it does 
allow for a striping of all HTML code by setting the open email to “edit” then 
choosing “plain text”  after you edit anything in the body of the email. Even 
if it is just a carriage return or a space. Close the email and save on exit 
and the whole email is rewritten, stripping out all HTML and resetting the 
header information to show “plain/text” and whatever you have the encoding set 
to. Stripping out all HTML from the emails was the only way I could think of to 
solve the unclosed  attribute in quite a few emails which was causing 
problems with the msgxxx.html pages.

It’s long past time for standardized header and html format for email. If 
anything it might secure them more...

Thanks
Tom

Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Earl Hood
On Sat, Dec 3, 2011 at 4:00 PM, Tom  wrote:
> Is Mhonarc not really supported any more or is this list dead?

This list is not dead, but does not get much traffic.

As for size limits, mhonarc is limited to RAM since each message
processed is loaded into memory.  Not efficient, but that is a
left-over from the initial code base.  Making things more efficient
would require a considerable rewrite of the core parsing code.
There is the -savemem option that can be used to try to reduce
the overall memory footprint (it keeps mhonarc from keeping a
handle of all new message data in RAM), but it does slow things
down due to more file I/O done.

As for HTML processing, there is a history of performance issues
with the regexes used to strip out markup that can cause security
problems, with the possibility of much memory being consumed.
This is somewhat an artifact of the regex and perl's regex
engine, but I lack the knowledge of perl's regex internals.
Attempts over time have been done to try to use regexes that
will not trigger engine problems.

IMO, HTML mail is horrendous, and the horrible lack of
conformancy by major mail providers just contributes to
the security problems HTML mail provides.  The latest release
clamped down on what HTML is acceptable since XSS vulnerability
risk increases (and were proven) with previous releases.

The FAQ provides information about the security problems of HTML
mail, and how to mitigate it.

If the solutions provided is not sufficient for your needs,
mhonarc is explicitly designed to allow you to register your
own handler for HTML messages, giving you full control of how
the are processed.  You can see if integrating a 3rd party
HTML parsing engine can be used.

I personally sick of dealing with the markup atrocities
of HTML mail, and have no incentive in do any more work
on it.  However, patches or alternate filters are always
welcome and will be considered to be included in mhonarc
itself if such contributions have no adverse effects.

--ewh



Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Tom
Is Mhonarc not really supported any more or is this list dead?   

Tom


On Dec 3, 2011, at 8:57 AM, Tom  wrote:

> I should have stayed away from Google Groups.  I liked the feature of having 
> a forum they could post from or just email the group.  What I don't like now 
> is the horrendous HTML of their output emails from their first couple of 
> years when groups was new. Broken tags, especially " Exclusion list, don't get me started- Authenticate-x, Mail-list name, 
> Domain-key verified, etc... It was almost like they were making them up when 
> they started. 2010 and 2011 emails are a lot better. 
> 
> Ok I know they messed up a lot of things too, but Microsoft has a pretty 
> powerful email re-writing script in Outlook 2010.  You can open a message, 
> enable editing strip out HTML, then resave it.  Seems to clean up and rewrite 
> the header a little better on the resave. 
> 
> What has me perplexed now is the 18 unknown messages.  My error log has 
> dropped significantly.  These 18 seem to be truncated emails of longer 
> emails.  I have 287 emails in the folder being read, writing 305 emails on 
> the run output, exactly the 18 unknown, subject, date, being written at the 
> end of my index.html page.  
> 
> Tom
> 



Re: Message body size limits? (Bigger Problem)

2011-12-03 Thread Tom
I should have stayed away from Google Groups.  I liked the feature of having a 
forum they could post from or just email the group.  What I don't like now is 
the horrendous HTML of their output emails from their first couple of years 
when groups was new. Broken tags, especially " wrote:

>> Google should be shot, maybe Yahoo too since almost every
>> single message that is blank is from Yahoo.
> 
> =v= GMail has led the way in making content more unreadable,
> (thereby breaking parsers), first by altering the message
> body to replace consecutive spaces with 8bit nonbreaking
> spaces, which of course triggers quoted-printable encoding;
> then by unnecessarily encoding a whole lot of characters,
> newlines in particular.
> 
> =v= Yahoo! Mail has for years been pretty brilliant about
> compatibility, but the last upgrade copied GMail and done
> the same thing to newlines.
><_Jym_>
> 



Re: Message body size limits? (Bigger Problem)

2011-12-02 Thread Jym Dyer
> Google should be shot, maybe Yahoo too since almost every
> single message that is blank is from Yahoo.

=v= GMail has led the way in making content more unreadable,
(thereby breaking parsers), first by altering the message
body to replace consecutive spaces with 8bit nonbreaking
spaces, which of course triggers quoted-printable encoding;
then by unnecessarily encoding a whole lot of characters,
newlines in particular.

=v= Yahoo! Mail has for years been pretty brilliant about
compatibility, but the last upgrade copied GMail and done
the same thing to newlines.
<_Jym_>



RE: Message body size limits? (Bigger Problem)

2011-12-02 Thread Tom Hutchison
I ran an older year, Google should be shot, maybe Yahoo too since almost
every single message that is blank is from Yahoo.

 

I invoked the error log, the process was running too fast to see and I also
couldn't scroll back far enough to see them all.

 

Here is a sample of what I am getting.

 

Warning: Rejecting Invalid HTML: Nested start tags

 Message-Id: <328830.92224...@web80202.mail.mud.yahoo.com>

 Message Subject:  Re: Crawford/Crafford/Craford/Crayford/Cranford

 Message Number: 00015

 

Warning: Empty body data generated:

 Message-Id: 328830.92224...@web80202.mail.mud.yahoo.com

Message Subject:  Re: Crawford/Crafford/Craford/Crayford/Cranford

 Message Number: 00015

 

In a mailbox with about 300 messages, I got about 40 of these errors. As a
result, no content of the message on the archive's HTML pages.

 

Any thoughts? Help really appreciated.

 

Thanks

Tom