Hello,

this week I was fixing last problems with parsing and processing 
mbox archives with mime emails from all debian archives. The 
result is that now archiving program can process all emails from 
all debian mailinglist archives.

There was a problem with processing mixed mboxcl/mboxrd archives 
from pipe because internal Email::Folder::Mbox module used seek 
function. This was fixed by introducing memory cache for non 
seekable filehandles. Seeks are used only for backward/fallback 
reading, so cache with previous lines is enough.

Another very big problem was with processing emails which has too 
many recipients addresses in From, To or Cc headers. Too many >= 
7. This is because email address with name can contains comments 
in these headers and for unknown reason regular expression in 
Email::Address module (for parsing these headers) are too slow. 
Some To header with lot of spaces and brackets took more than 10 
minutes to parse which is not usable. Looks like this problem was 
caused by possibility to parse nested comments (which are by rfcs 
allowed). Module has special variable COMMENT_NEST_LEVEL for 
setting nest level, but it was ignored after module was loaded.

But finally I found way how to use that variable (without need to 
changing source code of that module) and disabling nested 
comments in mail addresses cause that module parsing that headers 
immediately (without 10 minutes delay).

I did not found any documentation how to use COMMENT_NEST_LEVEL, 
but if somebody will have same problem... instead traditional 
command "use Email::Address;" is needed to call require and 
before that manually changing variable. Something like this 
working:

BEGIN {
        local $Email::Address::COMMENT_NEST_LEVEL = 1;
        require Email::Address;
        import Email::Address;
}

Current performance of archiver:

Processing and archiving each mbox file to correct archive (about 
27GB) takes 302 minutes.

Calling that script again (when all mbox archives are already 
imported) without new emails takes about 30 seconds which is 
quite good. So incremental import should be fast enough as it 
skip mbox files which was not modified.

-- 
Pali Rohár
[email protected]

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Soc-coordination mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/soc-coordination

Reply via email to