Hello,

this week I worked on speeding up parsing, processing and 
importing emails from mboxes files into archive.

My choice, Email::Folder::Mbox, module for parsing mbox archive 
is still not perfect and maintainers have not reviewed my first 
patch yet. I included that module in my tree and added more 
patches which speed up processing emails.

Now my program will get message-id (as unique identifier) of 
email at time when reading next message from mbox archive and 
does not have to wait until full MIME processing is complete. 
This speed-up skipping emails which are already processed.

Next I changed module for parsing dates (now using Date::Format) 
which looks like is faster than old (DateTime). And this new 
module can parse more date formats which can be found in debian-
devel emails (different variants which violates rfc2822).

With these changes time for importing all emails from debian-
devel archive is decreased from 23min to 17min. And running 
program again (when it skip all emails) will take only 2.40min 
(before it was 16min).

Repeated run on same data (which only skipping all emails) is 
better, but still not ideal. Because debian emails from one ML 
are stored in more mbox archives, I started using last 
modification time of mbox archive. Caching timestamps of each 
processed mbox file allows me to skip opening whole mbox file if 
cached timestamp is not older. After implementing this feature 
repeated run on all already imported mbox archives takes less 
than one second.

-- 
Pali Rohár
[email protected]

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Soc-coordination mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/soc-coordination

Reply via email to