RE: Bug #2 : When one letter goes to two archived lists

1998-12-24 Thread Albert . Langer

[JB]
1) Our first pass of sorting heuristics works roughly like this:
   Examine incoming message. Take all known listnames from archives,
   and grep each against the headers of the incoming mail. As soon as
   we find a match, file the mail away. Since we go alphabetically, we
   end up matching [EMAIL PROTECTED] for both letters that bear
   To: [EMAIL PROTECTED], [EMAIL PROTECTED]

   It has been suggested one aproach might be to cross check against
   envelope addresses or other headers.

   Cost: This particular sort occurs several hundred times a day, and
   on average required 125 greps and takes several seconds. So it's
   already kind of expensive; I'd rather not see it get a lot more
   expensive. There is some leeway, as the time bottleneck is in
   the MHonArc archiving runs.

[AL]
Sorry, but following is a suggestion for re-design rather than a quick
fix.

a) I don't believe anything based on above approach is capable of
scaling to extremely large numbers of lists and messages.

b) Freely available MTA and news software is the product of major
developments to achieve scalable and reliable processing of messages
into mailboxes or outwards queues or news foders, which is the key
functionality you require.

c) Such software has no effective scalability limits because MTAs can
relay to other MTAs etc before final delivery into mailboxes or
newsgroups that are ultimately processed by MHonArc. (This still allows
MHonArc to work in batch mode - processing all recently arrived messages
for the mailbox or newsgroup corresponding to a particular archive at
once).

d) So assign each email list it's own archive address and subscribe it
under that address. Keep the current method of adding a list by
subscribing a generic address but then automatically generate an
individual address, subscribe under that and cancel the original
subscription.
This is the hard bit and would require software similar to what you are
already doing with respect to messages for both the generic mailbox and
the undeliverable mailbox in order to process new lists. But all the
regular messages would be handled efficiently by the MTA or news
software.

e) Further scalability plus reliability can be easily added by using
subdomains for groups of email lists and MX records to route them to
appropriate archiving boxes.

f) Future efficiency can be improved by modifying the MTA/news software
to hash the recipient name into a sort of inode number on a special file
system (or even IP address plus inode plus file system number).

g) Above approach involves simply customizing existing mail software
through special config files etc in the same way that you have
customized MHonArc. But it renders unto MTAs that which belongs to MTAs
instead of building MTA facilities into the archive customization. It
also allows for a scalable namespace of mailing lists and directory
facilities, custom searches etc that are naturally integrated with MTAs
and newsservers.

[BC]
2) In order to improve efficiency, if any mail has queued up in the
   inbox, we gracefully switch to a batch operation. Once the initial
   sorting and list determination is made, we also grab any other mail
   in the inbox for that particular list, and archive it all
   together. This is done via MH refile commands.

   So, for example, we get two identical letters addressed to To:
   [EMAIL PROTECTED], [EMAIL PROTECTED], one from each
   list. Lets say they arrive nearly simultaneously. The first one
   will get sorted to first_nations, possibly erroneously. (see
   above).

   Then, we will do a sweep of the inbox looking for other
   first_nations mail. The MH refile commands will grab the other
   message. Both will get refiled to first_nations.

   Cost: Cost is important in the MH refile section - this is the
   batch mode for when things get really busy. Any expense here will
   affect performance limits. One the other hand, MHonArc is
   is still the bottleneck, so don't feel too constrained.

[AL]
My re-design proposal above eliminates MH completely as well as
eliminating this problem.
(If based on news software rather than mail software it also enables
future enhancements for filing messages for more than one list in one
file with a symbolic or hard link to each list. Also enables easier
mirroring etc.)

[BC]
To solve this problem, it makes sense to read the code and understand
the sorting algorthm. (Not hard to do, it's short; look at the file
called mailme) One possibility is to do checks to make sure nothing
ever gets erroneously pulled into the filter. Another possibility is
to look over the material that has been pulled into the filter and put
things back in the inbox if they are not correct. I don't know the
right solution.

[AL]
I realise above proposal does not help with immediate bug fix. But
when reading the code and understanding the algorithms it is important
to always keep in mind the option of re-design. An advantage of NOT
being involved in 

Re: Bug #2 : When one letter goes to two archived lists

1998-12-24 Thread Jeff Breidenbach


Hi Albert,

Let me repeat and agree with your statement. The essential
contribution of mail-archive.com is the concept of being able to
easily archive email lists by just adding the generic archives address
as a subscriber

That contribution is based on a core technology - a system to
automatically sort (demultiplex) list email sent to a single
address. I can forsee similar technology being used in other
applications than mail-archive.com.  For example, it might be used by
email user agents, both local and web based, to do automatic sorting
on behalf of a single human user. There may be other, as yet unforseen
uses.

Thus, I want to polish the automatic sorting algorithm if I can, since
it is the key technology. That's why I'm willing to spend time
tackling the automatic sorting bugs.

Jeff

PS Happy holidays.