[JB]
1) Our first pass of sorting heuristics works roughly like this:
Examine incoming message. Take all known listnames from archives,
and grep each against the headers of the incoming mail. As soon as
we find a match, file the mail away. Since we go alphabetically, we
end up matching [EMAIL PROTECTED] for both letters that bear
To: [EMAIL PROTECTED], [EMAIL PROTECTED]
It has been suggested one aproach might be to cross check against
envelope addresses or other headers.
Cost: This particular sort occurs several hundred times a day, and
on average required 125 greps and takes several seconds. So it's
already kind of expensive; I'd rather not see it get a lot more
expensive. There is some leeway, as the time bottleneck is in
the MHonArc archiving runs.
[AL]
Sorry, but following is a suggestion for re-design rather than a quick
fix.
a) I don't believe anything based on above approach is capable of
scaling to extremely large numbers of lists and messages.
b) Freely available MTA and news software is the product of major
developments to achieve scalable and reliable processing of messages
into mailboxes or outwards queues or news foders, which is the key
functionality you require.
c) Such software has no effective scalability limits because MTAs can
relay to other MTAs etc before final delivery into mailboxes or
newsgroups that are ultimately processed by MHonArc. (This still allows
MHonArc to work in batch mode - processing all recently arrived messages
for the mailbox or newsgroup corresponding to a particular archive at
once).
d) So assign each email list it's own archive address and subscribe it
under that address. Keep the current method of adding a list by
subscribing a generic address but then automatically generate an
individual address, subscribe under that and cancel the original
subscription.
This is the hard bit and would require software similar to what you are
already doing with respect to messages for both the generic mailbox and
the undeliverable mailbox in order to process new lists. But all the
regular messages would be handled efficiently by the MTA or news
software.
e) Further scalability plus reliability can be easily added by using
subdomains for groups of email lists and MX records to route them to
appropriate archiving boxes.
f) Future efficiency can be improved by modifying the MTA/news software
to hash the recipient name into a sort of inode number on a special file
system (or even IP address plus inode plus file system number).
g) Above approach involves simply customizing existing mail software
through special config files etc in the same way that you have
customized MHonArc. But it renders unto MTAs that which belongs to MTAs
instead of building MTA facilities into the archive customization. It
also allows for a scalable namespace of mailing lists and directory
facilities, custom searches etc that are naturally integrated with MTAs
and newsservers.
[BC]
2) In order to improve efficiency, if any mail has queued up in the
inbox, we gracefully switch to a batch operation. Once the initial
sorting and list determination is made, we also grab any other mail
in the inbox for that particular list, and archive it all
together. This is done via MH refile commands.
So, for example, we get two identical letters addressed to To:
[EMAIL PROTECTED], [EMAIL PROTECTED], one from each
list. Lets say they arrive nearly simultaneously. The first one
will get sorted to first_nations, possibly erroneously. (see
above).
Then, we will do a sweep of the inbox looking for other
first_nations mail. The MH refile commands will grab the other
message. Both will get refiled to first_nations.
Cost: Cost is important in the MH refile section - this is the
batch mode for when things get really busy. Any expense here will
affect performance limits. One the other hand, MHonArc is
is still the bottleneck, so don't feel too constrained.
[AL]
My re-design proposal above eliminates MH completely as well as
eliminating this problem.
(If based on news software rather than mail software it also enables
future enhancements for filing messages for more than one list in one
file with a symbolic or hard link to each list. Also enables easier
mirroring etc.)
[BC]
To solve this problem, it makes sense to read the code and understand
the sorting algorthm. (Not hard to do, it's short; look at the file
called mailme) One possibility is to do checks to make sure nothing
ever gets erroneously pulled into the filter. Another possibility is
to look over the material that has been pulled into the filter and put
things back in the inbox if they are not correct. I don't know the
right solution.
[AL]
I realise above proposal does not help with immediate bug fix. But
when reading the code and understanding the algorithms it is important
to always keep in mind the option of re-design. An advantage of NOT
being involved in