Hello, this week I worked on speeding up parsing, processing and importing emails from mboxes files into archive.
My choice, Email::Folder::Mbox, module for parsing mbox archive is still not perfect and maintainers have not reviewed my first patch yet. I included that module in my tree and added more patches which speed up processing emails. Now my program will get message-id (as unique identifier) of email at time when reading next message from mbox archive and does not have to wait until full MIME processing is complete. This speed-up skipping emails which are already processed. Next I changed module for parsing dates (now using Date::Format) which looks like is faster than old (DateTime). And this new module can parse more date formats which can be found in debian- devel emails (different variants which violates rfc2822). With these changes time for importing all emails from debian- devel archive is decreased from 23min to 17min. And running program again (when it skip all emails) will take only 2.40min (before it was 16min). Repeated run on same data (which only skipping all emails) is better, but still not ideal. Because debian emails from one ML are stored in more mbox archives, I started using last modification time of mbox archive. Caching timestamps of each processed mbox file allows me to skip opening whole mbox file if cached timestamp is not older. After implementing this feature repeated run on all already imported mbox archives takes less than one second. -- Pali Rohár [email protected]
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Soc-coordination mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/soc-coordination
