Persistent message URIs and a mid redirector, was: Re: [Mailman-Developers] Requirements for a new archiver

2003-11-02 Thread Chris Croome
Hi

Appols to butt in without having had the time to properly follow the
thread...

For me the thing that I hate most about the current mailman web
archives is the lack of persistent URIs, the fact that you open a
mbox to edit out soemones phone number they sent to a public list by
mistake and after you have rebuild the archives most message URIs
have changed and as a result dozens of carfully constructed wiki
pages referencing these email are broken :-( 

If a new archive only resulted in persistent URIs for messages I'd
be happy, I guess peole know this classic?

  Cool URIs don't change
  http://www.w3.org/Provider/Style/URI

Also are people aware of the neat hack that the W3C uses where by
you can get to any message in their list archives with a URI like
this:

  http://www.w3.org/mid/$MID

And in addition each outgoing message from their list server has
this URI in the header, for example:

 X-Archived-At: http://www.w3.org/mid/[EMAIL PROTECTED]

This header is added with a procmail rule:

  http://groups.yahoo.com/group/rss-dev/message/3163

Chris

-- 
Chris Croome   [EMAIL PROTECTED]
web design http://www.webarchitects.co.uk/ 
web content management   http://mkdoc.com/   

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-31 Thread Brad Knowles
At 8:20 PM -0500 2003/10/30, J C Lawrence wrote:

 While I don't disagree, this is really an MTA's job, not Mailman's.
 This is why I've been doing log analysis of MXes and routing mail to
 customised outbound MTAs on the basis of responsiveness, since early
 2000.  Adaptive MX routing is great stuff.
	There is a need for this function, and no MTA available today 
does it.  MLMs throughout the history of the Internet have 
incorporated a variety of features for SMTP performance enhancement 
that are unique to mailing lists or are usually found primarily in 
mailing lists, and this is no different.

	If you want to externalize all these functions outside of 
mailman, that's fine.  But then someone has to pick up the ball and 
start hacking on bulk_mailer or some other program to provide these 
features.

 Yup.  I did it at the first level with an initial SMTP proxy which
 routed based on MX response records pulled from a DB.
	Again, this is a feature which is not found on any MTA available 
today, and which is known to have a huge impact on mailing list 
performance.  This feature needs to be provided somewhere, by someone.

 I'm generally of the view that Mailman should do opportunistic domain
 sorting and per-MTA customised VERP handoffs (because nobody has
 standardised VERP across MTAs), and beyond that to back off.  Mailman's
 job is to get the outbound mail into the MTA's spool as quickly as
 possible, wrapped in transactions (ie RCPT TO bundles) that are friendly
 to efficient processing, and that's it.
	If you go back to Barry's message, he was talking about getting 
even further involved, by doing a mail-merge process.  Since there is 
no MMTP (something that Bryan Costales, Eric Allman, and I had worked 
on for a while, before we realized that it would just make the spam 
problem worse and then dropped all further efforts), there is a need 
for an intermediate program that is called by mailman and then hands 
the messages off to the MTA.

	Either that intermediate program can be provided by mailman 
itself, or it can come from a third party.  But it needs to come from 
somewhere.

 We're not in the game of second guessing the MTAs.  That way lies wasted
 time and madness.
	If there were MLTAs which were optimized for this function, I 
would agree with you.  Since we're trying to take standard MTAs which 
may have only some optimizations that might be generally applicable 
to most situations (including mailing lists), I must disagree.

	For the mailing list specific optimizations that we know are not 
provided by many common MTAs or MTA versions, we need to perform 
those optimizations before the message gets to the MTA.

	We also need to be able to selectively turn them off, in the case 
that there are MTAs that can do that specific job themselves and 
don't need our interference.

 Where Mailman's performance hurts is in the handling of the list
 configs, especially for lists with very large memberships rosters and in
 queue runner performance and overhead (try watching queue runner's
 system resource profile in v2.1 for lists with  50,000 members).  For
 me those are the obvious low hanging fruit,
	You should definitely go after the low-hanging fruit when you 
can.  However, you also have to consider how much work would go into 
fixing those problems.

	A high priority item that would require re-engineering the entire 
system is something that should be planned for the long term, perhaps 
in conjunction with other things that would likewise require 
significant re-engineering efforts as well.

	Meanwhile, if there are other performance issues that can be 
addressed which do not require such significant re-engineering, those 
should be given serious consideration in the shorter term.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-31 Thread J C Lawrence
On Fri, 31 Oct 2003 16:04:43 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 8:20 PM -0500 2003/10/30, J C Lawrence wrote:

 While I don't disagree, this is really an MTA's job, not Mailman's.
 This is why I've been doing log analysis of MXes and routing mail to
 customised outbound MTAs on the basis of responsiveness, since early
 2000.  Adaptive MX routing is great stuff.

 There is a need for this function, and no MTA available today does it.
 MLMs throughout the history of the Internet have incorporated a
 variety of features for SMTP performance enhancement that are unique
 to mailing lists or are usually found primarily in mailing lists, and
 this is no different.

True.  Its not a very difficult process, and is absurdly expensive the
way I handle it.  At some point in my copious spare time I should whack
another couple config tokens into Exim, just to up the ante.

 If you want to externalize all these functions outside of mailman,
 that's fine.  But then someone has to pick up the ball and start
 hacking on bulk_mailer or some other program to provide these
 features.

Aye, but some care should be taken here defining who the people are,
between the Good-For-Mailman, and Good-For-Large-Mail-Systems camps.
They're related, but not synonymous.

 Yup.  I did it at the first level with an initial SMTP proxy which
 routed based on MX response records pulled from a DB.

 Again, this is a feature which is not found on any MTA available
 today, and which is known to have a huge impact on mailing list
 performance.  This feature needs to be provided somewhere, by someone.

True.

 If you go back to Barry's message, he was talking about getting
 even further involved, by doing a mail-merge process.  Since there is
 no MMTP (something that Bryan Costales, Eric Allman, and I had worked
 on for a while, before we realized that it would just make the spam
 problem worse and then dropped all further efforts), there is a need
 for an intermediate program that is called by mailman and then hands
 the messages off to the MTA.

nod

Mailmerge and VERP customisation, and the standards for the
communication of those things to the MTA are areas that need attention,
both for Mailman and the rest of the market (tho the IronPort and
related guys might argue).  This would be a good point to get some
cross-MTA discussion going on.

 We're not in the game of second guessing the MTAs.  That way lies
 wasted time and madness.

 If there were MLTAs which were optimized for this function, 

IIRC QMail has a (typically DJB) VERP/rewrite handoff method.  I also
recall that it is very bound into QMail's process and IO model, but
perhaps this should be examined?

 I would agree with you.  Since we're trying to take standard MTAs
 which may have only some optimizations that might be generally
 applicable to most situations (including mailing lists), I must
 disagree.

There's that audience problem again.  I actually agree with you in the
general case, and am willing to spend time and effort in that direction.
However I see this as somewhat disjoint from Mailman in specific.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread Barry Warsaw
On Thu, 2003-10-30 at 00:08, J C Lawrence wrote:

 Hang-on.  Apache isn't the target.  Mailman's UI is a CGI app.  As such
 it works with any web server that supports CGI-bin, which pretty much
 means any web server with no exceptions.  That's a pretty large gain,
 especially in the novice admin or simple deployment case territory.

Sure, but I suspect that plumbing Mailman out to http will be just a
proxy rule away from integrating with an existing web server.  That's
not without its headaches too, but should be as widely supported.

 Doing our own thing for HTTP handling can quickly be another Pandora's
 box, security concern, and integration problem for the (majority of)
 people who do want to run Apache/Boa/Thttpd/Zeus/etc.

We do need to worry about the security of the http framework (e.g.
Twisted), but past that, it's still our responsibility.  I mostly see
this as a thin veneer between the web and the core logic for Mailman. 
Wanna use CGI?  I suspect it's just a little extra glue.  Same goes for
mod_python or whatever.

  An approach like Exim + elspy affords some really cool possibilities.
 
 Absolutely, but that is outside of Mailman's territory.

Definitely for now, that's for sure.  I don't want to write it off
completely, but we need to be practical too.

 More interesting would be things like TMDA integration, or implementing
 support for Yakov Shafranovich extension of my consent token protocol:
 
   http://www.ietf.org/internet-drafts/draft-irtf-asrg-cri-00.txt
 
 Getting early buy-in as a sample implementation for an MLM wouldn't be a
 Bad Thing.  There's a lot of really neat and useful integration and
 feature set territory to explore before you start staring down the MTA's
 throat.

Sure.  I just skimmed the CRI draft, but here's some questions (hmm, if
you answer this please start a new thread).  If you send 10 messages to
a list within 10 minutes and I've never heard of you before, should I
send you 10 challenges or one?  If I send you 10, should I consider a
response to any one of them good enough to free all 10 posts?  Also,
isn't any CRI system going to have to have mail bomb defenses?

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 09:15:35 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Thu, 2003-10-30 at 00:08, J C Lawrence wrote:

 Hang-on.  Apache isn't the target.  Mailman's UI is a CGI app.  As
 such it works with any web server that supports CGI-bin, which pretty
 much means any web server with no exceptions.  That's a pretty large
 gain, especially in the novice admin or simple deployment case
 territory.

 Sure, but I suspect that plumbing Mailman out to http will be just a
 proxy rule away from integrating with an existing web server.  That's
 not without its headaches too, but should be as widely supported.

Considerably more web servers support CGI-bin than support proxy rules.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread Barry Warsaw
Ok, I'm beat up enough, so let me open things up to a hopefully more
productive thread.  How can Mailman more efficiently hand off messages
to a local mail server for final delivery?

Some problems with the current approach include:

- The desire/requirement that Mailman chunk and sort recipients

- The ability for Mailman to swamp the mail server or cause the mail
server to consume all available cpu

- The fact that failures in upstream mail server are reported to Mailman
as bounces instead of as error codes

- Inefficiencies in VERP/personalization/mail-merge because of the lack
of cooperation

- The need for Mailman to queue outgoing messages that aren't completely
delivered

I'm sure you guys can identify more issues wink.  Look at the
complexity in SMTPDirect.py, and even there, we still have problems.

So how do we design a system where we can push the complexity and
efficiency concerns out past our boundary?  Here's a rough sketch of
what I'd like:

Mailman has a list of recipients, or at least knows how to calculate
that list.  It has a message template as encoded 7-bit ascii.  It has a
dictionary (association table, hash table) of substitution placeholders
to values for each recipient, or knows how to calculate that.

Mailman wants to simply hand that data off to some agent and forget
about it.  It wants to know that the agent will make best effort to mail
merge and deliver.  It wants to be informed of any final delivery
failures.  And that's it.  Mailman doesn't want to chunkify recipients,
and it doesn't want to sort them.  It doesn't want to worry about a mail
server effectively managing system resources.  I'd rather not have to
hand it a couple of meg of recipient or substitution data, but there
seems to be no other way.

So what can we do here to improve matters?

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 07:04:19 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 12:40 AM -0500 2003/10/30, J C Lawrence wrote:

 I've already said my bits there and proposed what I see as the cheap,
 easy, incremental improvement course: Twisted's NNTP supports for
 storage, Message IDs for keys, a variant best-effort detection and
 rewriting policy for collisions, and a MeoWWW derivative for HTML
 presentation/posting.

 I don't know anything about Twisted or MeoWWW, so I can't say how they
 address the subjects above.

Twisted is a pythonic library that implements most of the basic network
protocols.  Among other things it has an RFC conformant NNTP server and
client implementations.  Creating an NNTP server with a backing message
store is, literally, three lines in Python.  Of course it doesn't
support all the nifties that real netnews servers do ala expires,
administrative controls, feeds, etc.  Its not intended for that market,
and Mailman doesn't need those supports.  If deployment sites need that,
they're going to be using inn2|[BCD}News|Diablo anyway.

MeoWWW is a (very inefficient but fixable) pythonic CGI which supports
reading and posting to netnews via NNTP.  It has various nice UI points,
a decent feature set (more than we have now), and does The Right Thing
in almost every aspect I've checked except for performance in the spool
reads.

 I can say that I'm not sure about an NNTP-based storage solution...

We should really start out by splitting that discussion.  NNTP is an
access protocol.  Netnews servers have various storage formats and
techniques.  Currently NNTP and IMAP are the only standardised
wide-deployment protocols for message spool access.  I'm not interested
in IMAP for the reasons previously discussed.  NNTP isn't great, but it
is already supported by Mailman for the new gating features and adds a
clean abstraction model which allows trivial replacement of Mailman's
implementation by inn2|[BCD]news|Diablo|whatever should the deployment
site wish.  Additionally, again as a standards-etc based protocol, it
allows clean abstraction for archive presentation: anything that talks
NNTP can now be an effective Mailman archive presenter.  Ditto for
archive indexing.

As a dev I'm interested in arguments about how to handle the store
behind the NNTP interface -- I find that stuff fun and intriguing -- but
also think they are fairly uninteresting right now for Mailman
specifically.  The 90% case for Mailman will have less than 200K
messages in their site-wide spool, and most of those an order of
magnitude less.  For me the interesting point is that once we abstract
the message storage behind a well-supported standards-based protocol we
can incrementally improve our implementation and those really concerned
with the larger cases can throw in inn2 or whatever else, like a filter
to SQL, instead.  ITMT we get the flexibility and time to grow and do it
Really Right.  Additionally, having adopted such a well defined
abstraction model once, moving down the road should something else
better appear it should be a comparatively small cost to support that in
addition or instead.

 ... although certain storage techniques we've recently discussed
 borrow a lot from extant NNTP implementations, and I'm not sure how
 much sense it would make to rip out just those parts we know we need,
 or if we could actually reasonably take the whole thing,
 kit-n-caboodle.

Which may indeed happen.

 I do believe that we need an alternative solution to the message-id
 header as it was presented to us in the message, as a stable
 guaranteed unique (well, as good as MD-5 or SHA-1 gets) message
 identifier that can always be used to refer to the exact same message
 no matter what.

I'm in split minds here.  I see the temptation.  I like using
Message-IDS, and they are a natural fit to the model semantically, but
messing with Message-IDs has unpleasant effects for some other systems.

shrug

 Whether we use this message identifier as a replacement for the
 message-id header value as it was presented to us -- I think that's a
 more philosophical discussion, and I think we should address it by
 allowing both options but deciding which would be a reasonable default
 to take.

nod I'm on the side of rewriting Message-IDs if we do generate our own
keys.  I don't like it, but it seems the cleanest approach.

 Given that the mailman UI is basically completely contained within the
 CGI, I'm inclined to leave it there and work on improving it
 internally, allowing us to continue to work with most any webserver
 the client may have.  

Agreed.

 I don't know how MeoWWW addresses this issue, either by replacing the
 webserver, or providing additional tools that may make it easier to
 present a good and consistent UI.

MeoWWW is a CGI as discussed above.  Twisted implements both sides of
HTTP in addition to the NNTP discussed above, but I haven't looked at
the details.

-- 
J C Lawrence
-(*)

Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread Chuq Von Rospach
On Oct 30, 2003, at 6:38 AM, J C Lawrence wrote:

Sure, but I suspect that plumbing Mailman out to http will be just a
proxy rule away from integrating with an existing web server.  That's
not without its headaches too, but should be as widely supported.
Considerably more web servers support CGI-bin than support proxy rules.

And think about all of the colo environments where it's getting 
installed. Proxy stuff may not be welcome there. And you make it 
difficult for someone to integrate Mailman into a larger site 
environment where they want to use tools (like mod_layout) to skin 
things.

Do we know what a typical installation of Mailman is like? Do we know 
how it's used? Do we know what kind of hardware it's really running on, 
or what environments? Do we know what the user base is? What their top 
ten wish list is?

Excuse me for sounding like a product manager, but are these features 
because they're needed, or because we think they'd be fun to implement? 
and are we building an upgrade the user base can use, or only alpha 
geek hardware owners?

(and in reality, I think Barry has a good intuitive sense of these 
issues, but I wanted to have all of us rememeber it, and maybe it 
wouldn't be a bad idea to get some objective data)



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread Brad Knowles
At 9:53 AM -0500 2003/10/30, Barry Warsaw wrote:

 I'm sure you guys can identify more issues wink.  Look at the
 complexity in SMTPDirect.py, and even there, we still have problems.
	I'm not a programmer, so I can't really help you there.  ;-(

 So how do we design a system where we can push the complexity and
 efficiency concerns out past our boundary?
	I can say that I think we need to look at all of the 
recommendations in the following papers:

Tuning Sendmail for Large Mailing Lists
Rob Kolstad
Proceedings of LISA '97
http://tinyurl.com/t09c
Drinking from the Fire(walls) Hose:
Another Approach to Very Large Mailing Lists
Strata Rose Chalup, Christine Hogan, Greg Kulosa, Bryan McDonald,
and Bryan Stansell
Proceedings of LISA '98
http://tinyurl.com/t09k
	There may be others that we need to look at, but of which I am 
not (yet) aware.  If anyone knows of any, please let me know.

	We're already doing some of the things recommended in these 
papers, but not everything.  And I think there may be a couple more 
things we can do that are not mentioned, but which would be a further 
help.

	However, if you want to hand all this work to an external final 
mail-merge delivery agent, this is moot.  We just need to make sure 
that the selected FMMDA addresses all these issues.  We could use an 
existing tool (e.g., bulk_mailer from 
ftp://cs.utk.edu/pub/moore/bulk_mailer/), or we could create a 
separate package to address this issue (of course, that brings the 
ball back into our court).

	Or, you could just have Chuq solve this problem for you, as he 
mentioned in 
http://mail.python.org/pipermail/mailman-developers/2000-May/006820.html. 
;-)

 So what can we do here to improve matters?
	Sounds to me like you want to externalize this whole process. 
Problem is, bulk_mailer is the only tool I know of that currently 
exists as a partial attempt to address this problem, although perhaps 
some additional work on it could fill in the rest.  Alternatively, 
you develop, or work with someone else to develop, an alternative to 
bulk_mailer that does all the things you want and which can be used 
as an external tool.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread Chuq Von Rospach
On Oct 30, 2003, at 7:48 AM, Brad Knowles wrote:

Tuning Sendmail for Large Mailing Lists
http://tinyurl.com/t09c
400K/day aggregate max

Drinking from the Fire(walls) Hose:
http://tinyurl.com/t09k
380K/day aggregate max

(yawn. My server's bored. snicker)

but seriously, both of them are built around pre sendmail 8.12  
environments. there's some interesting stuff there, but it's now fairly  
dated, since sendmail 8.12 really changes the landscape. And all of  
those other environments

	Or, you could just have Chuq solve this problem for you, as he  
mentioned in  
http://mail.python.org/pipermail/mailman-developers/2000-May/ 
006820.html. ;-)
gack.


 So what can we do here to improve matters?
	Sounds to me like you want to externalize this whole process. Problem  
is, bulk_mailer is the only tool
Because pretty much every MLM has internalized the process. By the end  
of november, I'll have completely retired any use of bulk_mailer on my  
systems for other solutions.

One big reason: increasing spam blocking (stupid or otherwise) of  
non-individually addressed email. The old list server setup of:

to: subscribers of list [EMAIL PROTECTED]
bcc: [EMAIL PROTECTED]
is increasingly risky as far as delivery is concerned. I also don't  
think it allows for the kind of personalization that's needed for your  
general audiences (help URLs, unsub URls, etc).

And with sendmail 8.12, queue groups and envelope splitting, frankly,  
bulk_mailer does more harm to the delivery stream than good. Just stuff  
it into sendmail, tune sendmail to split intelligently. bulk_mailer is  
obsolete... and much to my amusement, a few sites block based on its  
use in headers (idiots), which is why my copy identifies itself as  
ulkbay_ailermay.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread John A. Martin
 claw == J C Lawrence
 Re: [Mailman-Developers] Requirements for a new archiver 
  Wed, 29 Oct 2003 21:22:32 -0500

claw I may be unusual in this regard, but I generally consider
claw list archives as one-way systems: messages go in and never
claw come out.

Out of idle curiosity, why doesn't 'write once read many' indicate a
directory more than a database?

jam



pgp0.pgp
Description: PGP signature
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 16:20:10 -0500 
John A Martin [EMAIL PROTECTED] wrote:

 Out of idle curiosity, why doesn't 'write once read many' indicate a
 directory more than a database?

1) The filesystem is a database.

2) Unix filesystems have extremely limited meta-data.

3) A discussed format is putting the mesasges on the filesystem (as a
BD), and the meta data in a different DB (primarily due to
open(2)/stat(2) expense.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Rewriting Message-ID (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread Barry Warsaw
On Tue, 2003-10-28 at 13:30, J C Lawrence wrote:

 Yup.  Of course this heads directly into that beautiful debate of
 whether MLMs should rewrite Message IDs.  Summarising briefly:
 
   If we rewrite all IDs we'll piss off the people who use ID to do dupe
   detection/deletion for courtesy copies.
 
   If we don't do some rewriting some messages won't make it through NNTP
   and some other people will be pissed off.
 
 Two contrasting approaches:
 
   1) We guarantee uniqueness of all Message IDs.  The only way to do
   this is to rewrite all IDs.  This will piss off some people.
 
   2) We best-effort guarantee uniqueness by only guaranteeing uniqueness
   within the last N messages to the list.  This could be one by
   rewriting all IDs, in which case we might as well guarantee total
   uniqueness, or it could be done by keeping a DB of the last N (cf
   CDBD) and either discarding or rewriting detected collisions.  This of
   course means that some messages will be discarded by NNTP and we won't
   know about it.  Some may be willing to accept those risks.

Nice summary, thanks.  Here's a strawman:

In the spirit of RFC 2369 we define a new header called List-Message-ID,
and as in that standard, this field MUST only be generated by a mailing
list, not by end users.  Nested lists SHOULD remove the parent's
List-Message-ID and supply its own.  List-Message-ID conforms to the
same syntax as for Message-ID in RFC 2822.  Of course, for now read the
header as if it had an X- prefix.

When an MLM receives a message, it generates a List-Message-ID header
which is guaranteed to be globally unique.  A cooperating archiver
should use this header as its primary key, and must provide a mechanism
whereby the List-Message-ID can be presented and the archived message
can be returned.  It may fall back to Message-ID when there is no
List-Message-ID header present.

Internally, we use List-Message-ID as the primary key into our message
store.

We further define a header (X-)List-Archived-Message which contains a
url pointing directly to this message in a cooperating archive.

Now we have some knobs we can tweak.

Q. When posting a message to News, when should Mailman copy the 
   List-Message-ID header to Message-ID?

A. Never, Only to resolve duplicate rejections, Always

Q. When reflecting a posted message back to the list, when should Mailman
   copy the List-Message-ID header to Message-ID?

A. Never, Always

I think it's time we started filling in the missing holes in the RFCs
for mailing list functions, such as the interactions we're describing
here.  I propose to start a section of the wiki (or perhaps
www.list.org) to collect these.  Eventually we should try to get
consensus with or archivers and MLMs, and then push a standard, but
that's a long way off.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread Barry Warsaw
On Wed, 2003-10-29 at 13:30, J C Lawrence wrote:

   2) Message IDs are not guaranteed globally unique, but the collision
   rate can be manageable/acceptable in a large number of deployment
   cases.

Ah, which reminds me, elaborating on my strawman, the answers to when
should Mailman rewrite Message-ID on posts should be: Never, Only to
resolve duplicates, Always.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Rewriting Message-ID (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 17:47:18 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:

 In the spirit of RFC 2369 we define a new header called
 List-Message-ID, and as in that standard, this field MUST only be
 generated by a mailing list, not by end users.  Nested lists SHOULD
 remove the parent's List-Message-ID and supply its own.
 List-Message-ID conforms to the same syntax as for Message-ID in RFC
 2822.  Of course, for now read the header as if it had an X- prefix.

 When an MLM receives a message, it generates a List-Message-ID header
 which is guaranteed to be globally unique.  A cooperating archiver
 should use this header as its primary key, and must provide a
 mechanism whereby the List-Message-ID can be presented and the
 archived message can be returned.  It may fall back to Message-ID when
 there is no List-Message-ID header present.

I haven't finished musing on this (busy day, thus slow on other replies
as well), but my first thought:

  What happens when a given a message is sent to several lists on the
  same host?

Does each list do its own munge?  Do we do USENET-style crossposting?  

I want to do crossposting.  I don't think we can due to per-list
customisations.  

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 17:51:27 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Wed, 2003-10-29 at 13:30, J C Lawrence wrote:

 2) Message IDs are not guaranteed globally unique, but the collision
 rate can be manageable/acceptable in a large number of deployment
 cases.

 Ah, which reminds me, elaborating on my strawman, the answers to when
 should Mailman rewrite Message-ID on posts should be: Never, Only to
 resolve duplicates, Always.

Does that mean that we keep a database of all Message-IDs that all lists
on that host have ever seen?  If so, what happens when a single message
is CC'ed to multiple lists?  NetNews servers require global uniqueness
across all newsgroups.

I'm rapidly coming to the conclusion that we have to rewrite all
Message-IDs whenever the internal archive is enabled.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 18:20:56 +0100
Brad Knowles [EMAIL PROTECTED] wrote:
 At 8:41 AM -0800 2003/10/30, Chuq Von Rospach wrote:

 One of them is recipient sorting by average delivery time over the
 past week (probably want a decaying geometric mean), which would
 require tracking log data on a per-recipient basis.

While I don't disagree, this is really an MTA's job, not Mailman's.
This is why I've been doing log analysis of MXes and routing mail to
customised outbound MTAs on the basis of responsiveness, since early
2000.  Adaptive MX routing is great stuff.

 Another is two-level message handling, by configuring the MTA for the
 initial delivery attempt to use very low timeouts, but then to fall
 back to a secondary MTA (or MTA pool) that uses more standard timeouts
 for those sites that are slower.

Yup.  I did it at the first level with an initial SMTP proxy which
routed based on MX response records pulled from a DB.

 Perhaps in its current form, that is true.  However, not all sites are
 using sendmail 8.12, and of the ones that are, most are probably not
 using it in a manner that is more suitable for mailing lists.

I'm generally of the view that Mailman should do opportunistic domain
sorting and per-MTA customised VERP handoffs (because nobody has
standardised VERP across MTAs), and beyond that to back off.  Mailman's
job is to get the outbound mail into the MTA's spool as quickly as
possible, wrapped in transactions (ie RCPT TO bundles) that are friendly
to efficient processing, and that's it.

We're not in the game of second guessing the MTAs.  That way lies wasted
time and madness.

 However, given the issues you've mentioned, it would probably be a
 good idea to be able to turn off selected bulk_mailer type features,
 so that you can let the MTA do more of it's job better -- if it is
 configured to do so.

There are thresholds for covering up for broken software.  There are
also thresholds for covering up for SysAdm negligence or oversight.
You've got to pick where you stop accepting the problem. Ideally we
should be resilient and friendly to both.  Realistically we need to do
something reasonable and not worry too hard about the rest.

Priorities.

Mailman's primary performance problems are not at the MTA hand off.  MTA
configuration and tuning for mailing lists is only a minor art.  There
is not-inconsiderable documentation and understanding of the field.  A
US$2K commodity box subjected to moderate tuning efforts using readily
available documentation can sustain 2,400 outbound deliveries per
minute.  You do the arithmetic.  In a perfect world that maps out to 3.4
million per day.  Cut that under half for queue injection overhead other
crap and you're still talking a million deliveries per day for a US$2K
host.[1] A million messages a day already puts us above the 99th
percentile for list server audiences.  I'm not really concerned about
that problem.

Where Mailman's performance hurts is in the handling of the list
configs, especially for lists with very large memberships rosters and in
queue runner performance and overhead (try watching queue runner's
system resource profile in v2.1 for lists with  50,000 members).  For
me those are the obvious low hanging fruit, and those are the points
that will help not just the performance hounds, but also the lower 80%
who are running under-provisioned under-configured under-admined
multi-purpose boxes who want Mailman to be a bit more reasonable and
forgiving about their not-so-brilliant systems.

  [1] That's of course assuming reasonable sustained queue size and
  responsive MXes.  However, those are separate problems and ignoring
  MTA-specific behaviours (like Exim's active hatred of large queues),
  the methods and systems to segment and tame those problems are fairly
  well known.

--
J C Lawrence
-(*)Satan, oscillate my metallic sonatas.
[EMAIL PROTECTED]   He lived as a devil, eh?
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 08:41:17 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 30, 2003, at 7:48 AM, Brad Knowles wrote:

 One big reason: increasing spam blocking (stupid or otherwise) of
 non-individually addressed email. The old list server setup of:

 to: subscribers of list [EMAIL PROTECTED] 
 bcc: [EMAIL PROTECTED]

 is increasingly risky as far as delivery is concerned. 

I've seen a couple mail BCPs and internal spam-handling plans at large
ISPs and corporates which explicitly include the line item:

  Discard all mail with more than one address in the envelope.

Scary, stupid, true: They want the pain to stop.  I find it hard to
blame them.

 I also don't think it allows for the kind of personalization that's
 needed for your general audiences (help URLs, unsub URls, etc).

Aye, such VERPish attributes is becoming a necessity.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: Efficient final message disposition (was Re: [Mailman-Developers] Requirements for a new archiver)

2003-10-30 Thread J C Lawrence
On Thu, 30 Oct 2003 09:53:19 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:

 - The desire/requirement that Mailman chunk and sort recipients

This shouldn't be any more complex than domain sorting, and need not be
perfect.

 - The ability for Mailman to swamp the mail server or cause the mail
 server to consume all available cpu

Rate limiting.

 - The fact that failures in upstream mail server are reported to
 Mailman as bounces instead of as error codes

I don't know that Mailman can do anything about this.  We can't reliably
distinguish between system errors and delivery failures for MTAs beyond
Mailman's borders.  There's a protocol hole here I don't know we can or
should attempt to fix.

 - Inefficiencies in VERP/personalization/mail-merge because of the
 lack of cooperation

Oh yeah.

 - The need for Mailman to queue outgoing messages that aren't
 completely delivered

Queue runner could do with some more intelligence in that dept.

 Mailman wants to simply hand that data off to some agent and forget
 about it.  It wants to know that the agent will make best effort to
 mail merge and deliver.  It wants to be informed of any final delivery
 failures.  And that's it.  Mailman doesn't want to chunkify
 recipients, and it doesn't want to sort them.  It doesn't want to
 worry about a mail server effectively managing system resources.  I'd
 rather not have to hand it a couple of meg of recipient or
 substitution data, but there seems to be no other way.

 So what can we do here to improve matters?

Start yelling at DJB, Wietse, Phillip, and Eric about a standardised
SMTP extension for VERP.  With a little luck and minor work we can
probably get some of the other commercial mail people involved as well.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:

 I was thinking about using MHonarc to enhance the archive experience but
 it doesn't work with MySQL directly so Mail::Box just might be what the
 doctor ordered.
	No database handles BLOB (Binary Large OBject) storage well. 
Even high-end databases have problems in this area.  IMO, this is a 
bad idea.

	Better would be to use a mailbox format that handles simultaneous 
multiple access reasonably well.  You can use c-client and mbx 
format, or MH format, or something else reasonably decent.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 12:41 AM -0500 2003/10/28, J C Lawrence wrote:

 Quite, this is how/why NNTP uses Message-IDs are unique indexing
 qualifiers.
	Problem is that client-assigned message-ids are not guaranteed 
unique.  Too many people are using RFC 1918 private addressing space, 
and if the machine doesn't know it's own name, then it stuffs in just 
the IP address for that portion.  Everything else could quite 
feasibly collide, and you'd wind up with multiple non-unique 
message-ids.

	You need a guaranteed unique id to be used as a primary index field.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 3:12 PM -0500 2003/10/27, Barry Warsaw wrote:

 What would then be in the database would be records providing easy
 lookup by message-id (at least) into the on-disk message store.
	Putting meta-data into the database would work.  Then use that 
index information to actually access the files.  I recommended the 
same in my invited talk at 
http://www.shub-internet.org/brad/papers/dihses/.

	Of course, if you're going to use a USENET interface, you should 
use Diablo as the back-end.  ;-)

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 16:29:09 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 12:41 AM -0500 2003/10/28, J C Lawrence wrote:

 Quite, this is how/why NNTP uses Message-IDs are unique indexing
 qualifiers.

 Problem is that client-assigned message-ids are not guaranteed
 unique.  

Right, and that was the point.  If we do nothing to Message IDs we don't
change external behaviour.  If we use a netnews backing store for the
archives and we don't dick with the message IDs we run the risk of some
messages never reaching the archives.  If we use a netnews backing store
and dick with message IDs we can offer various levels of guarantee that
messages reach the archives, and of pissing off users because we messed
with the Message IDs.

As always, you get to pick.

 Everything else could quite feasibly collide, and you'd wind up with
 multiple non-unique message-ids.

In which case the many people currently using ID-based dupe collapsing
(eg default Exchange config) will lose messages, and the archives will
lose messagesOR...we offer some level of guarantee (see yesterday's
discussion) with the matching trade-offs.

 You need a guaranteed unique id to be used as a primary index field.

Need is a strong word.  Its very deployment and use-case sensitive.
There are a large number of cases where I'm content to rest on the
assurance that the Message IDs arriving at my lists will always be
unique.  There are also a large number of cases where I'm not willing to
make that assessment, as well as a large number of cases where I'm
willing to simply discard anu duplicated Message ID messages at the
archiver level.  Similarly, there are cases where re-writing the Message
IDs in any form is significantly troubling, and cases where its not.

Need?  No.  It is a deployment choice with easily understood
ramifications.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:48 AM -0500 2003/10/29, J C Lawrence wrote:

 You need a guaranteed unique id to be used as a primary index field.
 Need is a strong word.  Its very deployment and use-case sensitive.
	In the case of a database, it is a hard requirement.  A primary 
index field must be guaranteed unique.  There is absolutely no way 
around this issue.

 Need?  No.  It is a deployment choice with easily understood
 ramifications.
	Perhaps for the application, but this is a totally different 
ballgame when it comes to a database.  Google for primary index 
field, and hopefully you will understand.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Kevin McCann
On Wed, 2003-10-29 at 10:13, Brad Knowles wrote:
 At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:
 
   I was thinking about using MHonarc to enhance the archive experience but
   it doesn't work with MySQL directly so Mail::Box just might be what the
   doctor ordered.
 
   No database handles BLOB (Binary Large OBject) storage well. 
 Even high-end databases have problems in this area.  IMO, this is a 
 bad idea.

Agreed. I was thinking more along the lines of storing the message body
as is, which, yes, might sometimes be base-64 encoded. Content headers,
boundary string, etc. could also be stored so as to make decoding (by a
web app) a cinch. You could go further and create attachment files and
point to it in an url or file field. But keep the message intact, as it
was received. That way if you want to get into after-the-fact message
delivery (manual resend, or maybe a member missed a message and wants it
in his/her inbox), it's not a chore.

The Messages_ table that Lyris uses in its database is a good starting
point if one wants to do the same kind of thing. I can dig up the specs
if there is interest.

- Kevin 




___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread John A. Martin
Brad Knowles [EMAIL PROTECTED] writes:

 At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:

  I was thinking about using MHonarc to enhance the archive experience but
  it doesn't work with MySQL directly so Mail::Box just might be what the
  doctor ordered.

   No database handles BLOB (Binary Large OBject) storage
   well. Even high-end databases have problems in this area.
   IMO, this is a bad idea.

   Better would be to use a mailbox format that handles
   simultaneous multiple access reasonably well.  You can use
   c-client and mbx format, or MH format, or something else
   reasonably decent.

Hmm... Maildirs.  With just a bit of minor trickery the unique
filename created to receive a message as it arrives at Mailman might
be put into the saved rfc822 header (much like MTAs place a queue id),
or into the message trailer if you must, and perhaps could be
preserved in the filename as the message is moved/copied from one
directory to another and thereby providing a unique index that can be
included in the message Mailman puts on the wire.

jam



pgp0.pgp
Description: PGP signature
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 9:41 AM, Brad Knowles wrote:

	In the case of a database, it is a hard requirement.  A primary index 
field must be guaranteed unique.  There is absolutely no way around 
this issue.
which is why it many times makes sense to generate your own. Consider, 
say, identifying all messages with an MD5 hash of the message then 
use that for all of your link generating and access work.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 18:41:20 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 11:48 AM -0500 2003/10/29, J C Lawrence wrote:

 You need a guaranteed unique id to be used as a primary index field.
 Need is a strong word.  Its very deployment and use-case sensitive.

 In the case of a database, it is a hard requirement.  A primary index
 field must be guaranteed unique.  There is absolutely no way around
 this issue.

Right, and I'm not arguing that.  My point is two fold:

  1) Using Message ID as a primary key is attractive.

  2) Message IDs are not guaranteed globally unique, but the collision
  rate can be manageable/acceptable in a large number of deployment
  cases.

We don't have to guarantee key uniqueness for all messages BEFORE they
are submitted to the message store.  The unique property can be assumed
from external sources (with all that implies) should the deployment case
want that.  There are tradeoffs here, and it is not clear to me that
there is an instant and obvious global solution.

 Need?  No.  It is a deployment choice with easily understood
 ramifications.

 Perhaps for the application, but this is a totally different ballgame
 when it comes to a database.  Google for primary index field, and
 hopefully you will understand.

I'm neither an idiot or a neophyte in this game.  Yes, a database needs
a primary unique key.  That's not in debate.  The questions are:

  Do we know the key before submission to the store?  

(If we don't the store operation shouldn't be asynchronous)

  Is the risk of discarded messages due to key collisions acceptable?

(Some deployment cases consider such losses acceptable, others can
guarantee uniqueness without Mailman's involvement)

Rotely assuming that Mailman must guarantee key uniqueness before we hit
the message store is not a given, its a choice.

Let's at least be on the same page.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 1:28 PM -0500 2003/10/29, John A. Martin wrote:

 Hmm... Maildirs.
	Not.

	From http://www.washington.edu/imap/documentation/formats.txt.html:

. mh   This is supported for compatibility with the past.  This is
the format used by the old mh program.
mh is very inefficient; the entire directory must be read
and each file stat()'d, and in order to determine the size
of a message, the entire file must be read and newline
conversion performed.
mh is deficient in that it does not support any permanent
flags or keywords; and has no means to store UIDs (because
the mh compress command renames all the files, that's
why).
	[ ... deletia ... ]

 The Maildir format used by qmail has all of the performance
 disadvantages of mh noted above, with the additional problem that the
 files are renamed in order to change their status so you end up having
 to rescan the directory frequently the current names (particularly in
 a shared mailbox scenario).  It doesn't scale, and it represents a
 support nightmare;
	[ ... deletia ... ]

So what does this all mean?

  A database (such as used by Exchange) is really a much better
 approach if you want to move away from flat files.  mx and especially
 Cyrus take a tenative step in that direction; mx failed mostly because
 it didn't go anywhere near far enough.  Cyrus goes much further, and
 scores remarkable benefits from doing so.
  However, a well-designed pure database without the overhead of
 separate files would do even better.


	Of course, we all know about the database problems of Exchange, 
and how Exchange admins have to frequently shut everything down and 
clean their databases, how often they crash, how often they 
completely trash all e-mail for all their users, etc

	I submit that the reason for this is the combination of crappy 
Microsoft-style programming and the fact that no database handles 
BLOBs well.  Even top-notch programmers have real problems with these 
kinds of implementations -- I am intimately familiar with the 
database implementation methods used in the AOL mail system, and 
suffice it to say that this is a really, really hairy nightmare that 
you do *NOT* want.

	That said, storing meta-data in a real database and then using 
external filesystem techniques for actually accessing the data, 
should give you the best of both worlds -- the speed of access of the 
database, and the reliability and well-understood access and backup 
mechanisms of filesystems.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 1:30 PM -0500 2003/10/29, J C Lawrence wrote:

 Right, and I'm not arguing that.  My point is two fold:

   1) Using Message ID as a primary key is attractive.
	Agreed.

   2) Message IDs are not guaranteed globally unique, but the collision
   rate can be manageable/acceptable in a large number of deployment
   cases.
	Outside of a database, this may be something you can decide 
whether or not to live with.  Within the confines of a database, this 
simply is not possible.

	The ANSI SQL specification has some hard requirements for a 
primary index key:

		1.  It cannot ever be null.

		2.  It must always be guaranteed unique.

	I'm sure there are other requirements.  But these two are a good start.

 We don't have to guarantee key uniqueness for all messages BEFORE they
 are submitted to the message store.
	All other keys could potentially be non-unique, or null, but not 
the primary index key.  This is why many applications have the 
database assign the primary index key itself on insertion into the 
table, so that all the necessary requirements can be met.

 I'm neither an idiot or a neophyte in this game.  Yes, a database needs
 a primary unique key.
	Then you must realize that we could not possibly use message-id 
as the primary index key, unless this is a field that we generate 
ourselves in such a way that all the necessary requirements are met.

 Rotely assuming that Mailman must guarantee key uniqueness before we hit
 the message store is not a given, its a choice.
	The message-id is not necessarily the primary index key.  See above.

	With regards to a primary index key, there simply is no choice. 
The message-id could continue to be one of the many secondary index 
keys, which is a totally different issue.

 Let's at least be on the same page.
	Agreed.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 10:45 AM, Brad Knowles wrote:

	That said, storing meta-data in a real database and then using 
external filesystem techniques for actually accessing the data, should 
give you the best of both worlds -- the speed of access of the 
database, and the reliability and well-understood access and backup 
mechanisms of filesystems.

Hint: look at what INN did when they implmented cycbufs.

Effectively, you create 1-N files, or create files as needed. Each file 
is N bytes long, pre-allocated on file creation. When you store 
messages, they're written into the file sequentially (or any other way 
you want. If you want to get into best fit allocations and turn this 
into a malloc() style heap, be my guest).

Metadata to access the info is then a filename, and an lseek() pointer 
into the file, and # of bytes to read, plus your normal identifying 
info. It's fast, it's efficient use of file pointers, it avoids the 
worst aspects of the unix file system, and I'm amazed nobody ever 
thinks to use it for other purposes (or that it took that long for 
usenet people to discover it, I suggested a simpler variant of it back 
in the 80s and was told inodes are our friends...)

you can even do expiration/purge/etc if you want, by moving stuff 
around and changing the pointers.

I've even thought of using it as the backing store for a picture 
library. With a nice relational database and a series of these data 
boxes, I think you have store data in the best and fastest possible 
way...



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Peter C. Norton
On Wed, Oct 29, 2003 at 07:45:53PM +0100, Brad Knowles wrote:
 At 1:28 PM -0500 2003/10/29, John A. Martin wrote:
 
  Hmm... Maildirs.
 
   Not.
 
   From 
   http://www.washington.edu/imap/documentation/formats.txt.html:

[deletia]

I don't know why a reasonable person would cite documentation
pertaining to UW-IMAP, a server that has been a standards, security
and performance bummer.

Why not cite http://www.courier-mta.org/mbox-vs-maildir/?

quote

Painting just about every filesystem in existence with the same
brush, and assuming that every filesystem works pretty much in the
same way, is very misleading. Many contemporary high performance
filesystem are designed explicitly for parallel access. For example,
consider the SGI XFS filesystem:

The free space and inodes within each AG are managed independently
and in parallel so multiple processes can allocate free space
throughout the file system simultaneously.[2]

It took me about 6 months to write the first revision of the
maildir-based Courier-IMAP server. The absence of maildir support in
the UW-IMAP server is the reason I wrote it. Many people have found
that it needed less memory, and was faster than UW-IMAP. Many people
observed that upgrading to Courier-IMAP lowered their overall system
load, and increased performance. Large mail clusters with a
network-based fault tolerant, scalable, architecture frequently have
problem deploying mbox-based mailboxes, due to many documented
problems with file locking (file locking is required for mbox-based
mailboxes) with network-based filesystems.[3] As referenced in [3],
maildirs have no issues with NFS (the most common type of a
network-based filesystem) since maildirs do not use locking.

After looking around for some time, I did not find any independent
benchmarks that directly measured the relative performance of mboxes
and maildirs. Therefore I decided to run some actual benchmarks
myself. I defined the test conditions according to UW-IMAP server's
documentation. I created a test environment that stacked the deck in
favor of mboxes. This was done in accordance with the claimed
shortcomings of maildirs as stated in UW-IMAP server's documentation,
in order to accurately measure the magnitude of the claimed problems.
/quote

and at the end:

quote

The final conclusion is that -- except in some specific instances --
using maildirs will be just as fast -- and in sometimes much faster --
than mbox files, while placing less of a load on the rest of the mail
system. The claims in the UW-IMAP server's documentation regarding
maildir performance can be supported only in certain, specific, very
narrowly-defined conditions. There is no simple answer on which mail
storage format is better. A lot depends on many variables that vary
widely in different situations. Besides the raw benchmarks shown
above, other factors include the mail server software being used, what
kind of storage is being used, and the available network
bandwidth. The final answer depends on all of the above.

/quote

[flame-bait deleted]

   A database (such as used by Exchange) is really a much better
  approach if you want to move away from flat files.  mx and especially
  Cyrus take a tenative step in that direction; mx failed mostly because
  it didn't go anywhere near far enough.  Cyrus goes much further, and
  scores remarkable benefits from doing so.
 
   However, a well-designed pure database without the overhead of
  separate files would do even better.
 
It always confounds me that people will go for database voodoo and
deride filesystems when a filesystem is a highly specialised database
in and of itself.  Putting things that are in a filesystem into a
database offers the power and flexability of querying, but certianly
should not be done for the sake of speed (assuming the
filesystem-based implementation meets whatever other requirements are
present).
 
   Of course, we all know about the database problems of Exchange, 
 and how Exchange admins have to frequently shut everything down and 
 clean their databases, how often they crash, how often they 
 completely trash all e-mail for all their users, etc

Which is a good lesson about databases: because of their flexability,
they cannot be qa'd to cope with all of their uses without being put
into production and losing data and being subsequently fixed.
Filesystems, which have a more narrowly-defined scope, tend to suffer
this less.  Thats why database logs that live on filesystems are used
for data recovery when a database eats itself.
 
   I submit that the reason for this is the combination of crappy 
 Microsoft-style programming and the fact that no database handles 
 BLOBs well.  Even top-notch programmers have real problems with these 
 kinds of implementations -- I am intimately familiar with the 
 database implementation methods used in the AOL mail system, and 
 suffice it to say that this is a really, really hairy nightmare that 
 you 

Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:38 AM -0800 2003/10/29, Chuq Von Rospach wrote:

 Hint: look at what INN did when they implmented cycbufs.
	I did.  See http://www.shub-internet.org/brad/papers/dihses/.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:54 AM -0800 2003/10/29, Peter C. Norton wrote:

 It always confounds me that people will go for database voodoo and
 deride filesystems when a filesystem is a highly specialised database
 in and of itself.
	I am aware of that.  I was aware of that when I first gave my 
invited talk entitled Design and Implementation of Highly Scalable 
E-mail Systems, which you can find at 
http://www.shub-internet.org/brad/papers/dihses/.

	Note that Eric Allman (author of the original Ingres database, 
among many other things) and Kirk McKusick (author of the Berkeley 
Fast File System) were in the audience.  I did not embarrass myself.

 Databases aren't meant to be storage for abstract binary data.
 They're meant to be a searchable index of data of types they
 understand.
	Correct.  And despite all claims to the contrary from the 
vendors, no database properly understands binary large objects, nor 
do they give you another datatype they do actually understand that 
would be suitable for the storage of e-mail message bodies.

 Assuming I had a clean slate to start a database project for a mail
 store, personally I'd much rather prototype it in something like
 postgresql where I could add data types to deal with email.  I could
 then make header types, text types, mime types classes, etc.  Then I
 could test to see if it was a good idea to implement it.
	IMO, that would be an exercise in futility.  We've been down this 
road a million times before.  We don't need to go down it again to 
know that the result is not likely to be successful, especially when 
we have alternatives that are proven to work well -- we store the 
message meta-data in the database, and then the message bodies in an 
separate message store akin to INN timecaf/timehash heaps (see 
http://www.shub-internet.org/brad/papers/dihses/lisa2000/sld090.htm).

 I think using a standard sql database for doing mail operations is
 asking for trouble.  Standard databases don't know how to parse
 rfc822/2822 headers and that means that you've got to either write a
 whole lot of stored procedures in a clunky query language (or
 java!?!?!) and then maintain it, or you've got to do it all in the
 imap/pop3/whatever server which means a whole lot of yammering traffic
 between the database and the I/P/W server all the time, which == slow.
	You don't ask the database to understand or parse RFC2822 headers 
or messages.  That's up to your application.  You just store data 
using the formats known to the database, and the message bodies 
according to the methods above.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Peter C. Norton
On Wed, Oct 29, 2003 at 09:25:53PM +0100, Brad Knowles wrote:
  Assuming I had a clean slate to start a database project for a mail
  store, personally I'd much rather prototype it in something like
  postgresql where I could add data types to deal with email.  I could
  then make header types, text types, mime types classes, etc.  Then I
  could test to see if it was a good idea to implement it.
 
   IMO, that would be an exercise in futility.  We've been down this 
 road a million times before.  We don't need to go down it again to 
 know that the result is not likely to be successful, especially when 
 we have alternatives that are proven to work well -- we store the 
 message meta-data in the database, and then the message bodies in an 
 separate message store akin to INN timecaf/timehash heaps (see 
 http://www.shub-internet.org/brad/papers/dihses/lisa2000/sld090.htm).

It seems like you're only partially agreeing/disagreeing with me
(optimist/pessamist).  Disagreeing: you're saying that using datatypes
in the database which are appropriate to the kind of data being stored
(mail messages) is an excercise in futility.  But, agreeing: that
storing these in a database in another way is OK.  I don't get why
you'd just want to store these as text when you have databases that
can be made more suitable to the problem.
 
  I think using a standard sql database for doing mail operations is
  asking for trouble.  Standard databases don't know how to parse
  rfc822/2822 headers and that means that you've got to either write a
  whole lot of stored procedures in a clunky query language (or
  java!?!?!) and then maintain it, or you've got to do it all in the
  imap/pop3/whatever server which means a whole lot of yammering traffic
  between the database and the I/P/W server all the time, which == slow.
 
   You don't ask the database to understand or parse RFC2822 headers 
 or messages.  That's up to your application.  You just store data 
 using the formats known to the database, and the message bodies 
 according to the methods above.

So all the parsing happens in the database client side.  Which is slow.

-Peter

-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence

On Wed, 29 Oct 2003 11:38:33 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:

 Hint: look at what INN did when they implmented cycbufs.

Aye, its a cute system.

 Effectively, you create 1-N files, or create files as needed. Each
 file is N bytes long, pre-allocated on file creation. When you store
 messages, they're written into the file sequentially (or any other way
 you want. If you want to get into best fit allocations and turn this
 into a malloc() style heap, be my guest).

 Metadata to access the info is then a filename, and an lseek() pointer
 into the file, and # of bytes to read, plus your normal identifying
 info. It's fast, it's efficient use of file pointers, it avoids the
 worst aspects of the unix file system, and I'm amazed nobody ever
 thinks to use it for other purposes (or that it took that long for
 usenet people to discover it, I suggested a simpler variant of it back
 in the 80s and was told inodes are our friends...)

Small caveat: Some modern fileystems make operating on the
one-file-per-message stores extremely efficient.  Admittedly they aren't
in wide cross-platform deployment, but the filesystems and file op
behaviour of today and yesteryear are not quite the same.

 I've even thought of using it as the backing store for a picture
 library. With a nice relational database and a series of these data
 boxes, I think you have store data in the best and fastest possible
 way...

Some years back I talked to Mike Belshe (used to be at Remarq) about
their storage techniques (I caught him shortly after Critical Path
bought Remarq).  Keying off other LISA papers they segmented their
storage space by object size, customising and configuring each segment
to suit (things like RAID strip size, number of spindles, FS tuning
parameters, etc).  He asserted that the rewards were very significant.

However, these are very large archive problems and are a bit outside of
Mailman's home turf.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 13:11:01 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 29, 2003, at 1:05 PM, David Birnbaum wrote:

 2.  third-party add-ons make it that much harder to install.  If I
 have to set up a Mysql or Postgres database to use Mailman, it's a
 step that will put off people who don't already have it going.
 
 actually, if you do it right, it's much easier -- because when you
 build in those tools, you build in standardized interfaces that third
 party add-ons can access, instead of the current case, which are code
 hacks that break every time Barry burps at the CVS server...

Aye, picking the right interface abstractions is key.

There's also a disjoint between the novice SysAdm case who loves the
fact of Mailman's all-in-one service, and the more meaty chap who
integrates what he needs to.  Much of Mailman's appeal at the low end is
its all-in-one simple-to-install nature.  (Well, ignoring thee GID
FAQ...)

Mailman v2.1 has a plugin layer for the membership roster.  Its not a
fully mature interface, but there are LDAP and SQL adaptors in the wild.
At some point those adaptors will move into the Mailman core.  If we
move the archiving components (storage, presentation, index) behind
plugin interfaces as well there's a reasonable opportunity for similar
third parties to build adaptor layers which then also move into the
Mailman core.

Oh yeah, and just to keep Nigel Metheringham hopping:

  Mailman just doesn't have enough configuration options.  

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Peter C. Norton
On Wed, Oct 29, 2003 at 10:14:52PM +0100, Brad Knowles wrote:
 
   I don't believe that there are any databases in existence that 
 ... can be made more suitable to the problem.
 

In theory you can add data types to postgresql.  Not that I've done it
myself, but its been done.

-Peter

-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 13:59:06 -0800 
Peter C Norton [EMAIL PROTECTED] wrote:
 On Wed, Oct 29, 2003 at 10:14:52PM +0100, Brad Knowles wrote:

 I don't believe that there are any databases in existence that
 ... can be made more suitable to the problem.

 In theory you can add data types to postgresql.  Not that I've done it
 myself, but its been done.

True, but that doesn't answer the question of whether an RDBMS is a good
storage tool for messages.  I spent a couple months of spare time last
year building an archiving system I liked atop PostgresQL using fully
decomposed SQL structures for all the message bits.  It was not a pretty
exercise, and the results were worse.  Brad makes excellent points in
his comments on poor BLOB support, the value if DBs for meta-data, and
disaster recovery ease.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 20:11:49 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 1:30 PM -0500 2003/10/29, J C Lawrence wrote:

 2) Message IDs are not guaranteed globally unique, but the collision
 rate can be manageable/acceptable in a large number of deployment
 cases.

 Outside of a database, this may be something you can decide whether or
 not to live with.  Within the confines of a database, this simply is
 not possible.

Of course, and that's the point.  We are in violent agreement.

 The ANSI SQL specification has some hard requirements for a primary
 index key:

I know, but that's not what I'm asserting.  I'll also ignore the DB
types which don't require primary keys of any form, as that's
essentially what we have now and we're assuming an indexed store
instead.

 We don't have to guarantee key uniqueness for all messages BEFORE
 they are submitted to the message store.

 All other keys could potentially be non-unique, or null, but not
 the primary index key.  

Ahh, I think I see the disjoint.  We're using key in two contexts
without distinguishing between them:

  1) The property of a message which identifies that message with a high
  probability of uniqueness.  This can be a Message ID, MD5SUM,
  whatever, but it is not guaranteed unique, it merely is unique most of
  the time for large definitions of most.

  2) The primary key as used in an indexed DB or other store which is
  guaranteed unique for all cases.

Between the two there's a conflict.  One requires perfect uniqueness.
The other delivers merely a good Best Effort.  The assertion is that we
don't always have to solve that mismatch.  We can elect to live with the
collisions.

 This is why many applications have the database assign the primary
 index key itself on insertion into the table, so that all the
 necessary requirements can be met.

Sure, except that doing that in our case requires that storage be a
synchronous operation (otherwise we don't know the key at
rewrite/delivery time).  That would a significant change from the
current model and rather unfriendly to a wide range of deployment cases.
Keeping the storage procedure asynchronous with an a-priori key (for
whatever guarantee of uniqueness) makes for a more interesting system.

 I'm neither an idiot or a neophyte in this game.  Yes, a database
 needs a primary unique key.

 Then you must realize that we could not possibly use message-id as the
 primary index key, unless this is a field that we generate ourselves
 in such a way that all the necessary requirements are met.

No, I don't realise that because it is false.  We can use Message IDs as
the primary key right now, today.  In fact, I am, right now, this
minute, today.  You are assuming that every message submitted to the
store must be accepted by the store.  That is an assumption that hasn't
been defined as a requirement and which some evidence suggests isn't a
hard requirement.  A very small percentage of the messages I submit to
my store don't make it.  They have duplicate Message IDs.  They run
through Mailman just fine.  They never reach my list archives.  I know,
expect, accept this.

The primary key has to be unique for every message IN THE STORE.
Accepted.  That does not dictate that the primary key for every message
SUBMITTED to the store has to be unique (not that key assignment is
occurring before collision check), or that the store has to ACCEPT every
message which is submitted to it. Guaranteeing perfect uniqueness of the
keys prior to submission to the store is fragile and expensive.  It is
tempting to do some form of very good approximation (cg Chuq's MD5SUM).
Without perfect synchrony with the store's keys. if we calculate keys
prior to insertion, or merely accept the keys that are given us in the
form of Message IDs we're going to get occasional collisions.

The question is how to handle messages whose a priori assigned keys
collide with keys already in the store.  We can handle the collision
case in several ways:

  1) Ignore it and discard messages bearing colliding keys.

  2) Best Effort attempt to guarantee uniqueness within a window, with
  collisions outside the window discarded.

  3) Fully guarantee uniqueness.

The first is easy.  The second is fairly easy.  The third isn't trivial.

In all three cases the population of key values in the store remains
unique.  Its just that the population of keys submitted to the store may
or may not be unique.  Lossage at the insertion layer can be acceptable.

 Let's at least be on the same page.

   Agreed.

Cool.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Peter C. Norton
On Wed, Oct 29, 2003 at 05:10:40PM -0500, J C Lawrence wrote:
 
 True, but that doesn't answer the question of whether an RDBMS is a good
 storage tool for messages.  I spent a couple months of spare time last
 year building an archiving system I liked atop PostgresQL using fully
 decomposed SQL structures for all the message bits.  It was not a pretty
 exercise, and the results were worse.  Brad makes excellent points in
 his comments on poor BLOB support, the value if DBs for meta-data, and
 disaster recovery ease.

I may not have made it clear, but I'm focusing on the metadata.  Once
you've parsed rfc822/2822, then it may become easier to have things in
the database that can manipulate those types.  I.e. to do be able to
do simple searches for a property of given arbitrary headers (w/o
having to have a database schema that consists of a few known headers
and others which you then have to treat as a blob or as text).

-Peter

-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 2:28 PM, Peter C. Norton wrote:

I may not have made it clear, but I'm focusing on the metadata.  Once
you've parsed rfc822/2822, then it may become easier to have things in
the database that can manipulate those types.  I.e. to do be able to
do simple searches for a property of given arbitrary headers (w/o
having to have a database schema that consists of a few known headers
and others which you then have to treat as a blob or as text).
my only real worry is that from what I've seen, 99.99% of the time, the 
user is going to want content searches. header stuff is fine, but of 
really low priority in the scheme of things (necessary to put useful 
things together, meaningless if you can't content/context search in 
fulltext).

that's why I'm leaning, blob issues or no, towards full-text storage in 
MySQL 4. Because if you can't easily chop up the message body content 
and find the messages you want to deal with, elegant storage of the 
headers is irrelevant...

I think you need that, too. But until you get a reasonable context 
search for the message body, designing the rest is silly. And it seems 
to me there are few better methods than dumping the text into MySQL and 
letting it do the work. Compromises, tradeoffs and etc 
notwithstanding...



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
by the way, this statement is in conflict with my previous statemenet 
of use cycbufs. I'm fully aware of that conflict, too. resolving it 
will be one of the big challenges.

On Oct 29, 2003, at 4:12 PM, Chuq Von Rospach wrote:

that's why I'm leaning, blob issues or no, towards full-text storage 
in MySQL 4. Because if you can't easily chop up the message body 
content and find the messages you want to deal with, elegant storage 
of the headers is irrelevant...


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 16:12:50 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 29, 2003, at 2:28 PM, Peter C. Norton wrote:

 I may not have made it clear, but I'm focusing on the metadata.  Once
 you've parsed rfc822/2822, then it may become easier to have things
 in the database that can manipulate those types.  I.e. to do be able
 to do simple searches for a property of given arbitrary headers (w/o
 having to have a database schema that consists of a few known headers
 and others which you then have to treat as a blob or as text).

 my only real worry is that from what I've seen, 99.99% of the time,
 the user is going to want content searches. header stuff is fine, but
 of really low priority in the scheme of things (necessary to put
 useful things together, meaningless if you can't content/context
 search in fulltext).

I see two needs, for significantly different populations.  The first
wants a browsing interface with keyed and indexed by date, thread, and
author.  The second wands full text search with rapid location and
retrieval of matching messages.  Often a single user will move between
the access methods, reading by thread, bouncing over to a search, then
reading all an author has written that match, then searching again, etc.
As such two distinct sets of indexes seem called for: full text and
message meta-data.

 that's why I'm leaning, blob issues or no, towards full-text storage
 in MySQL 4. Because if you can't easily chop up the message body
 content and find the messages you want to deal with, elegant storage
 of the headers is irrelevant...

True.  However, but this seems to conflate two distinct problems.  If
you're going to do unindexed searches then this makes sense, however
except for minimal cases that's an interesting space.  It scales like
crap and has an even worse feature set.  It is more interesting to split
storage and indexing into distinct solution designs, and to build or
pick something tailored for that smaller problem.  That way you don't do
full text searching, you do full text indexing and then search the
indexes.

 I think you need that, too. But until you get a reasonable context
 search for the message body, designing the rest is silly. 

Is searching message bodies really interesting, or is building indexes
of message bodies such that you can later search those indexes the
actually interesting point?  

 And it seems to me there are few better methods than dumping the text
 into MySQL and letting it do the work. Compromises, tradeoffs and etc
 notwithstanding...

How does MySQL help you in building language-sensitive rapid response
indexes of large text blobs?

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 4:12 PM -0800 2003/10/29, Chuq Von Rospach wrote:

 that's why I'm leaning, blob issues or no, towards full-text storage
 in MySQL 4. Because if you can't easily chop up the message body
 content and find the messages you want to deal with, elegant storage
 of the headers is irrelevant...
	I think you could do full word indexing per message, and then 
store that index information in the database.  Searching for phrases 
would require hitting the message bodies themselves, but searching 
for individual words could be done on indexed fields.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 16:40:53 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:

 by the way, this statement is in conflict with my previous statemenet
 of use cycbufs. I'm fully aware of that conflict, too. resolving it
 will be one of the big challenges.

cycbufs implement a filesystem-based heap with pool semantics.  (There's
a fair bit of literature on that space in the OS and application realm)
As such they are specifically tuned for the case where the number of
calls to malloc() are of a similar magnitude to the calls to free().
This makes sense in a netnews world where news articles expire
regularly, and in general as much data is added to the spool as is
removed from it.

Does that model really apply to list archives?  It doesn't for me.  I
may be unusual in this regard, but I generally consider list archives as
one-way systems: messages go in and never come out.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 02:52:52 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:

 I think you could do full word indexing per message, and then store
 that index information in the database.  Searching for phrases would
 require hitting the message bodies themselves, but searching for
 individual words could be done on indexed fields.

Consider an index which records not just the fact of a token's presence
in an entity, but also the offsets at which it occurs within the entity.
Searching for phrases then consists of searching for objects which
satisfy the boolean X AND Y, as well as the smaller clause offset(X)
+ length (X) + 1|2 == offset (Y).  Larger phrases extend the
equivalence language linearly, tho they create exponential search costs.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 5:52 PM, Brad Knowles wrote:

	I think you could do full word indexing per message, and then store 
that index information in the database.  Searching for phrases would 
require hitting the message bodies themselves, but searching for 
individual words could be done on indexed fields.

you could, but is it worth doing it yourself when MySQL is building it 
for you?

http://www.mysql.com/doc/en/Fulltext_Search.html

http://jeremy.zawodny.com/blog/archives/000576.html

http://www.zend.com/zend/tut/tutorial-ferrara1.php

If you were just storing into a TEXT and then doing SELECT LIKE into 
it, I'd agree with you. But MySQL is doing interesting things here. Why 
not leverage it?



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 6:16 PM, J C Lawrence wrote:
I see two needs, for significantly different populations.  The first
wants a browsing interface with keyed and indexed by date, thread, and
author.  The second wands full text search with rapid location and
retrieval of matching messages.  Often a single user will move between
the access methods, reading by thread, bouncing over to a search, then
reading all an author has written that match, then searching again, 
etc.
As such two distinct sets of indexes seem called for: full text and
message meta-data.

I think you need that, too. But until you get a reasonable context
search for the message body, designing the rest is silly.
Is searching message bodies really interesting, or is building indexes
of message bodies such that you can later search those indexes the
actually interesting point?
You're basically asking why do you need google when you have yahoo?

ask the folks who depend on google.

(and yes, I'm oversimplifying to make a point).

How does MySQL help you in building language-sensitive rapid response
indexes of large text blobs?
just posted a bunch of links.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 6:22 PM, J C Lawrence wrote:

cycbufs implement a filesystem-based heap with pool semantics.  
(There's
a fair bit of literature on that space in the OS and application realm)
As such they are specifically tuned for the case where the number of
calls to malloc() are of a similar magnitude to the calls to free().
This makes sense in a netnews world where news articles expire
regularly, and in general as much data is added to the spool as is
removed from it.

Does that model really apply to list archives?  It doesn't for me.  I
may be unusual in this regard, but I generally consider list archives 
as
one-way systems: messages go in and never come out.

and in general, you're mostly right. Deletions out of archives are 
pretty minimal. But I think cycbufs still make a lot of sense as a way 
to reduce design complexity needed to avoid using up potentially 
infinite numbers of inodes, and the performance and design complexity 
inherent in building a storage structure around a typical unix 
filesystem.

It's just so much less hassle on any number of levels dealing with 50 
100 megabyte files than it is a directory structure with 500 megabytes 
of messages spread around 100,000 individual files. whether it's 
backups and restores, migrating data to a new server, etc, etc etc, you 
make life much simpler. And god help you if you're updating that 
structure when the system crashes and you have to fsck and put it back 
together again.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 9:22 PM -0500 2003/10/29, J C Lawrence wrote:

 cycbufs implement a filesystem-based heap with pool semantics.  (There's
 a fair bit of literature on that space in the OS and application realm)
 As such they are specifically tuned for the case where the number of
 calls to malloc() are of a similar magnitude to the calls to free().
 This makes sense in a netnews world where news articles expire
 regularly, and in general as much data is added to the spool as is
 removed from it.
	So long as the calls to malloc() are kept reasonably small (which 
is typically true in this case), it shouldn't matter whether or not 
there are any free() calls.  Yes, you slowly build up more disk space 
in utilization, but all archive solutions will have the same problem, 
and this solution will scale as well as, or better than, any other 
that I know of.

	Consider the case where you are trying to store all news articles 
that have ever been posted -- not really much difference.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 04:08:45 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 9:22 PM -0500 2003/10/29, J C Lawrence wrote:

 cycbufs implement a filesystem-based heap with pool semantics.
 (There's a fair bit of literature on that space in the OS and
 application realm) As such they are specifically tuned for the case
 where the number of calls to malloc() are of a similar magnitude to
 the calls to free().  This makes sense in a netnews world where news
 articles expire regularly, and in general as much data is added to
 the spool as is removed from it.

 So long as the calls to malloc() are kept reasonably small (which is
 typically true in this case), it shouldn't matter whether or not there
 are any free() calls.  

I've written several heap managers including several pool based systems
as well as other sorts of custom allocators.  There are a great many
simplifications that come along with the write-once approach, especially
in terms of the trade-offs between allocation expense and free space
management.

 Yes, you slowly build up more disk space in utilization, but all
 archive solutions will have the same problem, and this solution will
 scale as well as, or better than, any other that I know of.

Which is not exactly my point.  cycbufs are a useful technique to be
sure, much as Chuq has discussed from a management perspective.  My
point is more that I don't see that they add anything essentially
different to the storage space in terms of storage semantics.  You get a
higher rate of file handle re-use, a more friendly filesystem behaviour
for older filesystem designs (pleasant optimisations), but exactly the
same single key - byte stream without adding any more interesting verbs
of transforms to the solution space.

This is not a Bad Thing, just not something that seems applicable at
this state in the design discussion.  First come ontology and semantics,
then comes implementation.

 Consider the case where you are trying to store all news articles that
 have ever been posted -- not really much difference.

Actually the two cases are considerably different.  In the delete case I
have to do pool management, with some eye toward fragmentation control
and optimisations of average latency for free heap searches, as well as
heap integrity audits.  In the write-only case I just build on the end
and need pay no mind to prior data once it is allocated.  In both cases
I have to do predictive work on the distribution of allocation sizes,
but that's far cheaper in the write-only case as the multiple-pool
search overhead can be entirely skipped.  There's a considerable
difference in complexity between the two.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 7:00 PM -0800 2003/10/29, Chuq Von Rospach wrote:

 you could, but is it worth doing it yourself when MySQL is building
 it for you?
 http://www.mysql.com/doc/en/Fulltext_Search.html
	From the top of this page:

6.8  MySQL Full-text Search

As of Version 3.23.23, MySQL has support for full-text indexing and 
searching.  Full-text indexes in MySQL are an index of type FULLTEXT. 
FULLTEXT indexes are used with MyISAM tables only and can be created 
from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added 
later with ALTER TABLE or CREATE INDEX.  For large datasets, it will 
be much faster to load your data into a table that has no FULLTEXT 
index, then create the index with ALTER TABLE (or CREATE INDEX). 
Loading data into a table that already has a FULLTEXT index could be 
significantly slower.

	Moreover, mail messages will be a undetermined variable length. 
Can MySQL support a 32-bit VARCHAR?  What about type TEXT?  Or 8-bit 
or even 16-bit character sets?  Since you might be storing a lot of 
MIME bodypart types, can it handle BLOBs, and can it handle them 
well?  Or, do you do parsing within your archive application and 
store the entire message somewhere outside of the database, while 
storing a FULLTEXT index of only the bodypart types you declare to be 
human-readable?

	What if you want to do a case-sensitive search?  In that case, it 
doesn't look like FULLTEXT or MATCH will do you any good, since MATCH 
is declared to be case-insensitive.  Or what if you want to search 
for hyphenated literals?  It seems that MATCH considers them to be 
word breaks even within literal searches.

 If you were just storing into a TEXT and then doing SELECT LIKE into it,
 I'd agree with you. But MySQL is doing interesting things here. Why not
 leverage it?
	I'm not sure it really helps in this case.  I'm not sure it can 
handle the amounts of data that might need to be stored into a field, 
or the different character sets that might need to be used.  I'm also 
concerned about what using this function might do to the overall 
speed and size of the database.

	On the page quoted above, look for benchmark data reported by Jim 
Nguyen  and John Takacs.  Two million rows with text and multiple 
word searches (three or more) taking 30-seconds to a minute to 
complete, is not good performance.  Three to five million rows, with 
searches taking 50 seconds or more for single words, is not good 
performance.

	Now, consider how many words might be in a single message 
(hundreds to thousands or even tens of thousands), and how many 
messages might be in a single archive (thousands to millions).  If 
each message was contained within a row, this would be dead-Universe 
slow.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 13:45, Brad Knowles wrote:

   That said, storing meta-data in a real database and then using 
 external filesystem techniques for actually accessing the data, 
 should give you the best of both worlds -- the speed of access of the 
 database, and the reliability and well-understood access and backup 
 mechanisms of filesystems.

I'm strongly in favor of this kind of approach.  I don't know what the
best on-disk storage format is (although cycbuf sounds interesting), but
I'm pretty sure we want the raw messages stored as plain files on the
file system.  

We may even want both the encoded and decoded messages stored on the
file system -- at the very least, we should have attachments decoded and
stored in separate files.  Then we want metadata about the messages
stored in a database.  We should be able to regenerate or update the
metadata by trolling over the raw message storage, and we should be able
to vend messages from the message store via any number of protocols.

The message store should be a central component of Mailman, but it
should be defined by an interface in case we decide to change the
implementation of the message store.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 10:27 PM -0500 2003/10/29, J C Lawrence wrote:

 Actually the two cases are considerably different.  In the delete case I
 have to do pool management, with some eye toward fragmentation control
 and optimisations of average latency for free heap searches, as well as
 heap integrity audits.  In the write-only case I just build on the end
 and need pay no mind to prior data once it is allocated.
	Not really.  You still have to maintain all the indexes, make 
sure that if things get moved around that all the links get updated, 
etc  True, you don't have to worry about fragementation control 
or other more complex aspects of heap management, but that's a 
further cost savings over other techniques and not a drawback to 
using this technique for this purpose.

	Now, if you want to consider what would happen to you if the 
Scientologists ever came after you, or if you had court orders to 
remove postings that linked to bomb-making instructions, you'd 
probably want to keep all those other tools related to heap 
management around anyway.  They'd be less likely to be used, but at 
least you wouldn't have to take the entire site down while you went 
and wrote the tools from scratch to handle a situation that you had 
not foreseen.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 14:38, Chuq Von Rospach wrote:

 Hint: look at what INN did when they implmented cycbufs.
 
 Effectively, you create 1-N files, or create files as needed. Each file 
 is N bytes long, pre-allocated on file creation. When you store 
 messages, they're written into the file sequentially (or any other way 
 you want. If you want to get into best fit allocations and turn this 
 into a malloc() style heap, be my guest).
 
 Metadata to access the info is then a filename, and an lseek() pointer 
 into the file, and # of bytes to read, plus your normal identifying 
 info. It's fast, it's efficient use of file pointers, it avoids the 
 worst aspects of the unix file system, and I'm amazed nobody ever 
 thinks to use it for other purposes (or that it took that long for 
 usenet people to discover it, I suggested a simpler variant of it back 
 in the 80s and was told inodes are our friends...)


I'm not sure if Andrew Koenig is on this list, but he described an
algorithm he developed to quickly find messages in an mbox file.  If
he's here, maybe he can talk about it.

I really don't like mbox files, primarily because they require munging
From lines in the body of the message.  MMDF would be better, but I
think ideal from a philosophical point of view would be
one-message-per-file if it can be done efficiently cross-platform. 
Maybe file system experts here can provide pointers or advice on exactly
which file and operating systems make this approach feasible, even for
huge message counts.

 you can even do expiration/purge/etc if you want, by moving stuff 
 around and changing the pointers.
 
 I've even thought of using it as the backing store for a picture 
 library. With a nice relational database and a series of these data 
 boxes, I think you have store data in the best and fastest possible 
 way...

It's a very interesting idea.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 15:41, J C Lawrence wrote:

 Some years back I talked to Mike Belshe (used to be at Remarq) about
 their storage techniques (I caught him shortly after Critical Path
 bought Remarq).  Keying off other LISA papers they segmented their
 storage space by object size, customising and configuring each segment
 to suit (things like RAID strip size, number of spindles, FS tuning
 parameters, etc).  He asserted that the rewards were very significant.
 
 However, these are very large archive problems and are a bit outside of
 Mailman's home turf.

Mailman's philosophy is, keep it as simple as possible to handle 80% of
the installations out there, but provide enough framework for the other
20% to extend for extreme uses.  Strategies to accomplish this include
defining interfaces to key components, and shipping something that works
out of the box and is good enough for most people.

It's not always easy, of course, to architect something that scales this
way.  I think we have a pretty good idea of the scaling problems with
Mailman 2, and I hope we can push the envelop significantly for Mailman
3.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 10:47 PM -0500 2003/10/29, Barry Warsaw wrote:

 I'm not sure if Andrew Koenig is on this list, but he described an
 algorithm he developed to quickly find messages in an mbox file.  If
 he's here, maybe he can talk about it.
	7th edition mbox files are a pain.  There are other mailbox file 
formats that are much better and easier to parse (UW-IMAP .mbx being 
one).

 I really don't like mbox files, primarily because they require munging
 From lines in the body of the message.  MMDF would be better, but I
 think ideal from a philosophical point of view would be
 one-message-per-file if it can be done efficiently cross-platform.
	Therein lies the problem.  Some filesystems make this more 
feasible than others, at least on larger scale systems.

 Maybe file system experts here can provide pointers or advice on exactly
 which file and operating systems make this approach feasible, even for
 huge message counts.
	SGIs XFS on Irix does a pretty good job, with hashed directory 
structures, and an extent-based journaling filesystem.  Regretfully, 
I don't think that all of these features are fully supported under 
the Linux version of XFS, and that work has basically ground to a 
halt with the lay-offs of all the key SGI people who had been working 
on XFS.  Veritas VxFS also does a good job in this area.

	Other than SGI XFS for Irix and Veritas VxFS, I don't know of any 
good solutions to this problem at the filesystem level.

	Kirk McKusick and Eric Allman agree with you that this is a 
proper filesystem problem that should be solved at the filesystem 
level (at least, that's what they've said to me when I brought this 
issue up to them), and they feel you should not attempt to solve 
filesystem problems with tricks like INN timecaf/timehash cycbufs.

	However, while that's nice in theory, that doesn't necessarily 
help us here in the real world.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 04:45:37 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 10:27 PM -0500 2003/10/29, J C Lawrence wrote:

 Actually the two cases are considerably different.  In the delete
 case I have to do pool management, with some eye toward fragmentation
 control and optimisations of average latency for free heap searches,
 as well as heap integrity audits.  In the write-only case I just
 build on the end and need pay no mind to prior data once it is
 allocated.

 Not really.  You still have to maintain all the indexes, make sure
 that if things get moved around that all the links get updated,
 etc  

With a write-once system you don't actually need to ever move anything.
At its core it is: Open one file, repetitively append to end until file
size exceeds size N, create new file, repeat.  You can do object size
clustering across files or other optimisation techniques, but the basic
pattern remains the same.  For the few cases you have to support delete
you either just NULL the byte stream for the pointed-to object, or you
invalidate the key.  As the frequency and number of such deletes is
infinitesimal, they require no special management complexity.  You can
afford to just swallow the lost free space as the cost of attempting to
manage it is simply never rewarded.

 True, you don't have to worry about fragementation control or other
 more complex aspects of heap management, but that's a further cost
 savings over other techniques and not a drawback to using this
 technique for this purpose.

True.  I'm not lableing it a drawback, just a boon of dubious advantage.

 Now, if you want to consider what would happen to you if the
 Scientologists ever came after you, or if you had court orders to
 remove postings that linked to bomb-making instructions, you'd
 probably want to keep all those other tools related to heap management
 around anyway.  

Not really.  The percentage of such deleted posts over the lifetime of
the store can be generally assumed to be less than 1 in 10^5, and is
probably considerably lower, if not in the 1:10^8 range.  Add a simple
invalid key semantic and you're done.

  Caveat: Continual addition and deletion of SPAM from an archive would
  change this balance.

 They'd be less likely to be used, but at least you wouldn't have to
 take the entire site down while you went and wrote the tools from
 scratch to handle a situation that you had not foreseen.

You're going to need tools when the percentage of such deleted postings
is sufficiently high that the cost of the lost free space and its
overhead exceeds the cost of managing that free space.  That's not a
quick thing.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 16:54, J C Lawrence wrote:

 Aye, picking the right interface abstractions is key.

Right on.

 There's also a disjoint between the novice SysAdm case who loves the
 fact of Mailman's all-in-one service, and the more meaty chap who
 integrates what he needs to.  Much of Mailman's appeal at the low end is
 its all-in-one simple-to-install nature.  (Well, ignoring thee GID
 FAQ...)

Yep, and I really really want Mailman 3 to take this concept farther. 
Some things that I think will help include, using Twisted to eliminate
the /requirement/ of Apache integration and possibly the incoming mail
server integration, as well as implement a bulk mailer to eliminate the
need for an outgoing mail server.  Ideally, it will still be possible to
integrate with a Postfix for incoming and outgoing, but it shouldn't be
necessary to get up and running.

 Mailman v2.1 has a plugin layer for the membership roster.  Its not a
 fully mature interface, but there are LDAP and SQL adaptors in the wild.

This interface was largely bolted on, so it's clumsy.  Mailman 3 will be
defined by interfaces from the start.

 At some point those adaptors will move into the Mailman core.  If we
 move the archiving components (storage, presentation, index) behind
 plugin interfaces as well there's a reasonable opportunity for similar
 third parties to build adaptor layers which then also move into the
 Mailman core.
 
 Oh yeah, and just to keep Nigel Metheringham hopping:
 
   Mailman just doesn't have enough configuration options.

Heh.  That's another issue.  I'm sure Mailman 3 will grow many more
configuration options.  The trick is making them manageable (and mostly
ignorable -- i.e. the defaults Usually Work out of the box).

I've been experimenting with ideas for list styles which will make list
admins lives easier I think, without reducing the flexibility for
experts.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
And since Barry's underlying philosophy is to minimize the number of 
things Mailman depends on, that sort of lets out depending on them 
having an OS with a high-performance journaling filesystem, no? 
(giggle)

On Oct 29, 2003, at 8:00 PM, Brad Knowles wrote:

	However, while that's nice in theory, that doesn't necessarily help 
us here in the real world.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 23:01:14 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Wed, 2003-10-29 at 16:54, J C Lawrence wrote:

 Aye, picking the right interface abstractions is key.

 Right on.

I'm still debating if I can run down there on the 8th.  I'd love to go
to EuroQuest, but I also really need to be in Providence on the 7th, and
back at work on the 9th.  Aaaarrrgh.  _IF_ I can make it we must go hit
a pub with whiteboards in hand.

Sorry for no earlier reply on this BTW, I'm in drowning eyeballs mode.

 ...as well as implement a bulk mailer to eliminate the need for an
 outgoing mail server.  

Eeeek!  I trust this would be for immediate handoff to a real MTA
versus handling final delivery directly?  Quite the Pandora's box if
not.

 Mailman v2.1 has a plugin layer for the membership roster.  Its not a
 fully mature interface, but there are LDAP and SQL adaptors in the
 wild.

 This interface was largely bolted on, so it's clumsy.  Mailman 3 will
 be defined by interfaces from the start.

nod

BTW Whatever happened to Michel Pelletier's interfaces PEP?  I see the
draft, and I see signs that something got done, but not what...

 Oh yeah, and just to keep Nigel Metheringham hopping:
 
 Mailman just doesn't have enough configuration options.

 Heh.  That's another issue.  

Last I heard Nigel was still running screaming into the hills.

 I'm sure Mailman 3 will grow many more configuration options.  The
 trick is making them manageable (and mostly ignorable -- i.e. the
 defaults Usually Work out of the box).

nod

 I've been experimenting with ideas for list styles which will make
 list admins lives easier I think, without reducing the flexibility for
 experts.

Aye, that's something the Plone folk have been digging at with some
success: a base library of waffle-stomp configuration patterns.  I'm not
sure for Mailman if we want just a picklist, or a very simple wizard.

I suspect something more akin to the very brief QA wizard at Creative
Commons for choosing a license type may be more effective and
interesting than a picklist:

  http://creativecommons.org/license/

Very simple, very general, covers the basic cases, hides all the ugly
stuff and picks sane defaults.  It becomes even more interesting if site
admins can tailor the configs for the basic cases.
  
-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 16:14, Brad Knowles wrote:

   One key factor here is that all of the information in the 
 database should be able to be re-created from the message bodies 
 alone, if there should happen to be a catastrophic system crash.

Just to be dense, let me ask for clarification: by message body you
mean the entire original message, as received on the wire, not just the
message payload (i.e. sans RFC 2822 headers).  If so, I agree
completely.

But I also think the decoded message should be stored on the file system
somehow as well.  I.e. decode attachments and store then as separate
files too.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:01 PM -0500 2003/10/29, J C Lawrence wrote:

 With a write-once system you don't actually need to ever move anything.
	Depends on how you manage the storage of those large files.  If 
you have an infinitely large filesystem that is guaranteed 100% 
reliable in all possible circumstances, you're right.  Otherwise, you 
might find that the filesystem is getting full and things need to be 
moved around, or you suffer a disk or storage system crash and you 
have to restore from backups, or you use an HSM solution to move 
older files to slower/higher capacity storage, or you have issues 
with too many large files in a single directory and need to implement 
your own directory hashing scheme, etc

 Not really.  The percentage of such deleted posts over the lifetime of
 the store can be generally assumed to be less than 1 in 10^5, and is
 probably considerably lower, if not in the 1:10^8 range.  Add a simple
 invalid key semantic and you're done.
	It depends on whether or not the court order allows you to just 
mark things as deleted and be done with it.  If they force you to 
actually expunge all copies of that data from your systems, you will 
have to do more work.

 You're going to need tools when the percentage of such deleted postings
 is sufficiently high that the cost of the lost free space and its
 overhead exceeds the cost of managing that free space.  That's not a
 quick thing.
	True enough, but as you've pointed out, there have been a number 
of implementations of this sort of solution, and you've worked on at 
least a couple yourself.  These sorts of tools should already be 
reasonably well understood and not too difficult to write or borrow 
from other sources.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:01 PM -0500 2003/10/29, Barry Warsaw wrote:

 Yep, and I really really want Mailman 3 to take this concept farther.
 Some things that I think will help include, using Twisted to eliminate
 the /requirement/ of Apache integration and possibly the incoming mail
 server integration, as well as implement a bulk mailer to eliminate the
 need for an outgoing mail server.
	There, I have to disagree.  Both the web server and the mail 
server issues are complex enough that I don't believe it would be a 
good idea to try and re-invent this wheel.  There are already enough 
bad web server and mail server implementations out there -- we don't 
need to make this situation worse.

	There may be some mailing-list specific issues that we can (and 
should) handle better inside mailman before we hand these things off 
to the other servers, but both Apache and postfix/sendmail/exim have 
enough experience and world-wide testing behind them to make it 
little else than folly resulting from hubris to try and replace them.

	There's just no substitute for having hundreds of millions of 
people world-wide pounding on these things day-in and day-out 365 
days a year.

	Components like this should be scheduled for replacement if, and 
only if, you can demonstrate beyond a reasonable doubt that there are 
inherent problems that are insurmountable otherwise, and there is no 
feasible alternative.

	You don't just take a Tom Mix pocket knife and cut open your own 
chest and remove your heart, to replace it with a mechanical pump 
that you designed yourself out of a tin can, a turkey baster, some 
bailing wire, and some garden hose.

	If you absolutely require a heart transplant and there are no 
human alternatives, you get a world-respected heart surgeon to 
perform the operation using the latest techniques and the Jaarvik 9 
(or whatever).  And then you get everyone in your family, all your 
friends, all your neighbors, all your church members, and hopefully 
all religious people world-wide to pray for you.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 22:06, Chuq Von Rospach wrote:

 It's just so much less hassle on any number of levels dealing with 50 
 100 megabyte files than it is a directory structure with 500 megabytes 
 of messages spread around 100,000 individual files. whether it's 
 backups and restores, migrating data to a new server, etc, etc etc, you 
 make life much simpler. And god help you if you're updating that 
 structure when the system crashes and you have to fsck and put it back 
 together again.

We should just throw everything into a ZODB FileStorage Data.fs file,
and let it grow to gigs in size 1/2 wink.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:18 PM -0500 2003/10/29, Barry Warsaw wrote:

 Just to be dense, let me ask for clarification: by message body you
 mean the entire original message, as received on the wire, not just the
 message payload (i.e. sans RFC 2822 headers).  If so, I agree
 completely.
	Yes, you are correct.  At issue is that there might be some 
headers which some users might wish to search on (or maybe just see) 
which might not be put into one or more of the fields, and you don't 
want to take the risk of losing those by assuming that you can always 
re-generate all the headers from what you've stored inside the 
database.

 But I also think the decoded message should be stored on the file system
 somehow as well.  I.e. decode attachments and store then as separate
 files too.
	My experience is that this is a bad idea.  However, if the 
implementation is fully modularized at the API level, then we can 
always rip out the mailman solution and instead put in something that 
actually works.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Peter C. Norton
On Thu, Oct 30, 2003 at 05:00:48AM +0100, Brad Knowles wrote:
   SGIs XFS on Irix does a pretty good job, with hashed directory 
 structures, and an extent-based journaling filesystem.  Regretfully, 
 I don't think that all of these features are fully supported under 
 the Linux version of XFS, and that work has basically ground to a 
 halt with the lay-offs of all the key SGI people who had been working 
 on XFS.  Veritas VxFS also does a good job in this area.

[ A cursory google search indicates that hashed dirs, extents, and
journalling are all in linux xfs.  I can't imagine an unsupported
feature making its way into the filesystem that SGI is putting on its
latest and greatest systems, but if you know about this, please share ]

In the case of a one-file-per-message approach, my experience with
vxfs is that it creates a rather slow filesystem when you get your
filesystem to the point of haing with a few hundred thousand small
files (lots of wasted space in the extents and I believe, though I may
be wrong, that there were lots of metadata lookups through multiple
layers of indirections slowing things down).  

However reiserfs was built to handle a mix of lots of small files, ala
maildir or mh spools.  

I'm not too current on current bsd going-ons, but I'd bet that ffs2
has something to offer in this arena, too, since it looks like it
almost does extent-based allocation now.

   Kirk McKusick and Eric Allman agree with you that this is a 
 proper filesystem problem that should be solved at the filesystem 
 level (at least, that's what they've said to me when I brought this 
 issue up to them), and they feel you should not attempt to solve 
 filesystem problems with tricks like INN timecaf/timehash cycbufs.

Err... then to relate this to a prior post, why not just use maildirs
on filesystems that are engineered to handle that sort of thing?
 
   However, while that's nice in theory, that doesn't necessarily 
 help us here in the real world.

Unless you are using a filesystem that works for this, right?  Like
xfs, vxfs, reiserfs, and probably ffs2.  I believe that linux's ext3
has support for hashing directories (or soon will - I don't precisely
know as I've been focusing on other things)

-Peter

-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 8:26 PM, Brad Knowles wrote:

	There may be some mailing-list specific issues that we can (and 
should) handle better inside mailman before we hand these things off 
to the other servers, but both Apache and postfix/sendmail/exim have 
enough experience and world-wide testing behind them to make it little 
else than folly resulting from hubris to try and replace them.

+1

I've experimented with direct-out-the-pipe delivery systems. Trust me, 
you don't want to go there. It's not trivial. Well, it's trivial for 
90% of the world that follows the RFCs and behaves as expected and has 
the right DNS setups and isn't trying to outsmart spammers by being 
stupid. and you'll spend the other 90% of your time trying to build 
compatibility in with the other 10%.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 8:27 PM, Barry Warsaw wrote:

We should just throw everything into a ZODB FileStorage Data.fs file,
and let it grow to gigs in size 1/2 wink.
troll
until you have to split it across two disks because one is full.
and don't forget, a single monolithic storage file gets backed up fully 
every time you change it. The guy in charge of buying tapes to back up 
your system just screamed in agony, since there's no possibility of an 
incremental backup for what is 99.999% static data.

/troll

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
And windows? And older hardware? Solaris 8? Hell, solaris 6 and 7?

You going to depend on people only running year-old-or-less hardware 
and OS?

On Oct 29, 2003, at 8:35 PM, Peter C. Norton wrote:

I'm not too current on current bsd going-ons, but I'd bet that ffs2
has something to offer in this arena, too, since it looks like it
almost does extent-based allocation now.


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:17, J C Lawrence wrote:

 I'm still debating if I can run down there on the 8th.  I'd love to go
 to EuroQuest, but I also really need to be in Providence on the 7th, and
 back at work on the 9th.  Aaaarrrgh.  _IF_ I can make it we must go hit
 a pub with whiteboards in hand.

Sounds great.  Bring a laptop and we'll bang out some code (anyone else
up for a mini-Mailman-3 sprint at my house? :).  I'll probably be
heading to Fedex Field on the 9th for a 'Skins game, so the 8th would be
perfect.

  ...as well as implement a bulk mailer to eliminate the need for an
  outgoing mail server.  
 
 Eeeek!  I trust this would be for immediate handoff to a real MTA
 versus handling final delivery directly?  Quite the Pandora's box if
 not.

Yep, which makes me nervous, but which does have a certain
standalone-ability appeal.  I don't want to write it off, and of course,
we'll have an interface for this so the first (only?) implementation
will be MTA hand-off.

 BTW Whatever happened to Michel Pelletier's interfaces PEP?  I see the
 draft, and I see signs that something got done, but not what...

Dead in the water AFAIK.  But there are lots of folks using a more
formal interface system for Python applications, such as for Zope3. 
Just writing the interface down, with good docstrings, goes a long way.

 Last I heard Nigel was still running screaming into the hills.

Hey, I love Exim -- Greg's done some very cool stuff with it on
mail.{python,zope}.org.  But man, I find it hard to track down just the
right knob I need to tweak. :)

 Aye, that's something the Plone folk have been digging at with some
 success: a base library of waffle-stomp configuration patterns.  I'm not
 sure for Mailman if we want just a picklist, or a very simple wizard.

I haven't even thought about how to surface it in the u/i -- it's mostly
machinery right now.  But yeah, a wizard is just the ticket, at least
for canned styles (which again, will solve 80% of the problem).

Which reminds me -- I'm really hoping we can get some web u/i jockies
and CSS geeks in to eventually make things real purty.  Dammit Jim, I'm
a musician, not a graphic artist. :)

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 05:15:58 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 11:01 PM -0500 2003/10/29, J C Lawrence wrote:

 With a write-once system you don't actually need to ever move
 anything.

 Depends on how you manage the storage of those large files.  If you
 have an infinitely large filesystem that is guaranteed 100% reliable
 in all possible circumstances, you're right.  Otherwise, you might
 find that the filesystem is getting full and things need to be moved
 around, or you suffer a disk or storage system crash and you have to
 restore from backups, or you use an HSM solution to move older files
 to slower/higher capacity storage, or you have issues with too many
 large files in a single directory and need to implement your own
 directory hashing scheme, etc

True, but most of those really end up being a meta-indexing problem.
You have many big files.  You have indexes which point into those many
big files.  Occasionally you move those big files about, so your
meta-indexes need to be changed point to the new locations of the big
files, but the same offsets within the big files...

Its really not an expensive or difficult space.

If you really need to move individual messages about between file blobs
at a respectable rate, then you're in another world of pain, but we
don't have any evidence of that requirement, or that such a requirement
can't be handled by simply unrolling the big file and respooling the
individual messages onto the ends of other big files in different
locations.

 Not really.  The percentage of such deleted posts over the lifetime
 of the store can be generally assumed to be less than 1 in 10^5, and
 is probably considerably lower, if not in the 1:10^8 range.  Add a
 simple invalid key semantic and you're done.

 It depends on whether or not the court order allows you to just mark
 things as deleted and be done with it.  If they force you to
 actually expunge all copies of that data from your systems, you will
 have to do more work.

Ahem.

  for key in list_of_bad_message_keys:
big_file, offset, length = get_message_big_file (key)
handle = open (big_file)
handle.seek (offset)
handle.write (' ', length)
handle.close ()
key.invalidate ()

Not a whole lot more complexity.  You're just invalidating the
pointed-to data as well as the key.  You're still not doing free space
management.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 23:27:46 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Wed, 2003-10-29 at 22:06, Chuq Von Rospach wrote:

 We should just throw everything into a ZODB FileStorage Data.fs file,
 and let it grow to gigs in size 1/2 wink.

There are good reasons I use DirectoryStorage:

  $ find /var/lib/zope/instance/default/var/Data_fs_dir -type f | wc -l
499266

Lotsa little teensy files!

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:26, Brad Knowles wrote:

   There, I have to disagree.  Both the web server and the mail 
 server issues are complex enough that I don't believe it would be a 
 good idea to try and re-invent this wheel.  There are already enough 
 bad web server and mail server implementations out there -- we don't 
 need to make this situation worse.

Let's not discount the integration problems, which are a huge headache
for newbies.  I'm fairly certain that Twisted is the right approach for
surfacing the web u/i to Mailman.  The requirements are not overwhelming
and fronting Mailman's u/i with Apache really doesn't buy us that much. 
We all agree that CGI sucks, and we could make that better with
mod_python or some other such glue, but why go to the trouble?

Relying on Twisted for the incoming mail protocols is something I'm less
certain about, although there is a lot of appeal to this approach.  We
could throw lots smarts into a Python port-25 listener, including global
spam fighting and bounce processing.  An approach like Exim + elspy
affords some really cool possibilities.  A bigger negative is that
there's less precedence for proxying smtpd as there is for httpd, so
it's harder to fit Mailman into the mix with an existing mail server.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:36, Chuq Von Rospach wrote:

 I've experimented with direct-out-the-pipe delivery systems. Trust me, 
 you don't want to go there. It's not trivial. Well, it's trivial for 
 90% of the world that follows the RFCs and behaves as expected and has 
 the right DNS setups and isn't trying to outsmart spammers by being 
 stupid. and you'll spend the other 90% of your time trying to build 
 compatibility in with the other 10%.

Chuq, do you think it would be feasible for Mailman to try to handle
that 90% itself, and then only hand-off to a Real MTA when it runs into
trouble with the other 10% -- assuming it could know when it runs into
trouble.

Also, there's incoming SMTP and outgoing SMTP.  It may be possible to
build in support for one direction without providing the other.  (It
also may not be worth it.)

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 20:37:49 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 29, 2003, at 8:27 PM, Barry Warsaw wrote:

 and don't forget, a single monolithic storage file gets backed up
 fully every time you change it. The guy in charge of buying tapes to
 back up your system just screamed in agony, since there's no
 possibility of an incremental backup for what is 99.999% static
 data.

Ha!  So just why do you think I moved off FileStorage for Data.fs?

That said there's some value in getting a versioning data store with
rollback support for list configs.  The data volume isn't huge, but it
is highly sensitive.  I'd also like to see flat text logging of all
configuration changes in addition to moderation activity.  It would save
the help and support desks a lot of hurt.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:40, Chuq Von Rospach wrote:
 And windows? 

Hey, ignoring Windows has been a successful strategy so far, why stop
now?  Plus, Longhorn will save us all, right?  Oh, and Everything Will
Be Faster Next Year Anyway.  wink

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 8:35 PM -0800 2003/10/29, Peter C. Norton wrote:

 [ A cursory google search indicates that hashed dirs, extents, and
 journalling are all in linux xfs.  I can't imagine an unsupported
 feature making its way into the filesystem that SGI is putting on its
 latest and greatest systems, but if you know about this, please share ]
	My understanding is that the port of XFS to Linux was only about 
70% done at the time the critical software engineers were laid off by 
SGI, and that no further work in this area has been done.  Maybe the 
features are supposedly there but incomplete.

 However reiserfs was built to handle a mix of lots of small files, ala
 maildir or mh spools.
	I'm sorry, I don't trust ReiserFS at all.  I'd trust XFS if it 
was on Irix, or IBMs JFS, but not ReiserFS.  Hell, on a Linux system, 
I'd use ext2fs before I'd use Reiser.

 I'm not too current on current bsd going-ons, but I'd bet that ffs2
 has something to offer in this arena, too, since it looks like it
 almost does extent-based allocation now.
	No, not yet.  There are improvements in the areas of handling 
synchronous meta-data updates, background fsck, etc... but nothing 
like extent-based filesystems or integrated hashed directory schemes, 
etc

 Err... then to relate this to a prior post, why not just use maildirs
 on filesystems that are engineered to handle that sort of thing?
	Because we can't guarantee that everyone (or anyone) would be 
willing/able to use the selected filesystems that we have blessed? 
You think requiring everyone to install PostgreSQL would be bad, do 
you really want to try to force them all to use ReiserFS on Linux as 
their only supported option?

 Unless you are using a filesystem that works for this, right?  Like
 xfs, vxfs, reiserfs, and probably ffs2.  I believe that linux's ext3
 has support for hashing directories (or soon will - I don't precisely
 know as I've been focusing on other things)
	My understanding is that ext3fs is dead.  The work that Stephen 
Tweedie had been doing stopped long ago, and even then it was only a 
minor tweak over ext2fs.  I don't believe that this work has been 
picked up again or extended to include other features.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 8:53 PM, Barry Warsaw wrote:
Chuq, do you think it would be feasible for Mailman to try to handle
that 90% itself, and then only hand-off to a Real MTA when it runs into
trouble with the other 10% -- assuming it could know when it runs into
trouble.
I think you have enough on your plate to not re-invent what others have 
already done pretty well. When you run out of features to implement, 
then think about this. Not until.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:37, Chuq Von Rospach wrote:

 troll
 until you have to split it across two disks because one is full.
 
 and don't forget, a single monolithic storage file gets backed up fully 
 every time you change it. The guy in charge of buying tapes to back up 
 your system just screamed in agony, since there's no possibility of an 
 incremental backup for what is 99.999% static data.

 /troll

Actually, newer versions of ZODB have a script called repozo.py which
makes incremental backups feasible.  It knows a lot about FileStorage's
formats.  Also note that there are alternative storage implementations
such as BerkeleyDB-based storage (slow, but presumably more reliable)
and the 3rd party DirectoryStorage.

We'll talk about databases in another thread.  I have my own biases, but
I'm too tired now to get into it.

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:55, J C Lawrence wrote:

 That said there's some value in getting a versioning data store with
 rollback support for list configs.  

+1

 The data volume isn't huge, but it
 is highly sensitive.  I'd also like to see flat text logging of all
 configuration changes in addition to moderation activity.

+1

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Barry Warsaw
On Wed, 2003-10-29 at 23:46, J C Lawrence wrote:

 There are good reasons I use DirectoryStorage:
 
   $ find /var/lib/zope/instance/default/var/Data_fs_dir -type f | wc -l
 499266
 
 Lotsa little teensy files!

:)

-Barry



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 23:50:22 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Wed, 2003-10-29 at 23:26, Brad Knowles wrote:

 There, I have to disagree.  Both the web server and the mail server
 issues are complex enough that I don't believe it would be a good
 idea to try and re-invent this wheel.  There are already enough bad
 web server and mail server implementations out there -- we don't need
 to make this situation worse.

 Let's not discount the integration problems, which are a huge headache
 for newbies.  

I thought the prevalence of canned Mailman packages was doing a lot
there?  I haven't watched the -users list in a while.

 I'm fairly certain that Twisted is the right approach for surfacing
 the web u/i to Mailman.  The requirements are not overwhelming and
 fronting Mailman's u/i with Apache really doesn't buy us that much. 

Hang-on.  Apache isn't the target.  Mailman's UI is a CGI app.  As such
it works with any web server that supports CGI-bin, which pretty much
means any web server with no exceptions.  That's a pretty large gain,
especially in the novice admin or simple deployment case territory.

Doing our own thing for HTTP handling can quickly be another Pandora's
box, security concern, and integration problem for the (majority of)
people who do want to run Apache/Boa/Thttpd/Zeus/etc.

 We all agree that CGI sucks, and we could make that better with
 mod_python or some other such glue, but why go to the trouble?

CGI sucks yes, but it is the guaranteed common denominator, and CD
counts for more than feature whiz-bang at this level.

 Relying on Twisted for the incoming mail protocols is something I'm
 less certain about, although there is a lot of appeal to this
 approach.  

-1

Tarbaby, pandora's box, security nightmare, unbounded security envelope.

 We could throw lots smarts into a Python port-25 listener, including
 global spam fighting and bounce processing.  

You ___really___ don't want to get into your own SMTP-level bounce
processing.  Really.  That's one huge endlessly sucking time sinker.
Let Phillip Hazel, Wietse and the rest spend their time there.  

 An approach like Exim + elspy affords some really cool possibilities.

Absolutely, but that is outside of Mailman's territory.

More interesting would be things like TMDA integration, or implementing
support for Yakov Shafranovich extension of my consent token protocol:

  http://www.ietf.org/internet-drafts/draft-irtf-asrg-cri-00.txt

Getting early buy-in as a sample implementation for an MLM wouldn't be a
Bad Thing.  There's a lot of really neat and useful integration and
feature set territory to explore before you start staring down the MTA's
throat.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:43 PM -0500 2003/10/29, J C Lawrence wrote:

 True, but most of those really end up being a meta-indexing problem.
	Fair enough.

 Not a whole lot more complexity.  You're just invalidating the
 pointed-to data as well as the key.  You're still not doing free space
 management.
	What about your backups?  And your off-site backups?  And your 
mirror sites around the world?  Any other copies of those files that 
might have been copied off somewhere else?

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 11:53 PM -0500 2003/10/29, Barry Warsaw wrote:

 Chuq, do you think it would be feasible for Mailman to try to handle
 that 90% itself, and then only hand-off to a Real MTA when it runs into
 trouble with the other 10% -- assuming it could know when it runs into
 trouble.
	Bryan Costales and Eric Allman had this debate at 
InfoBeat/Mercury Mail.  Bryan said that he could write a better 
simple MTA that could handle the easy 80% and leave the hard 20% to 
sendmail.  Eric showed that he could improve sendmail to the point 
where it would perform at or near the level of performance of Bryan's 
code without throwing everything out, and would out-perform every 
other aspect of the system in question (so that the MTA was no longer 
the bottleneck at any stage).

	I'm confident that the same sort of approach is appropriate for 
other well-respected MTAs (e.g., postfix, and exim in my personal 
experience).

 Also, there's incoming SMTP and outgoing SMTP.  It may be possible to
 build in support for one direction without providing the other.  (It
 also may not be worth it.)
	It's hard enough writing an incoming SMTP handler, and doing it 
right.  Many large service providers have seriously screwed up when 
trying to do so (bigfoot anyone?), and others have only implemented 
half of the inbound solution (AOL), leaving the harder parts to 
standard programs like sendmail.

	Even then I argued violently against this approach at AOL, and 
felt that we could do a better job by leaving all the external 
interfacing/queueing issues to sendmail, and instead make the 
in-house developed code an LMTP Local Delivery Agent.  I was 
over-ruled, primarily because we had already gone too far down the 
road that had been chosen for us.  Note that none of the original 
Internet Mail Operations team members are left at AOL (almost all 
bugged out when the new mail server software came online), and I 
don't think any of the original Internet Mail Development team 
members are left, either.

	Bad Juju, Bwana.

	I've been down this road before.  Trust me, you don't want to do this.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 20:59:56 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 29, 2003, at 8:53 PM, Barry Warsaw wrote:

 Chuq, do you think it would be feasible for Mailman to try to handle
 that 90% itself, and then only hand-off to a Real MTA when it runs
 into trouble with the other 10% -- assuming it could know when it
 runs into trouble.

 I think you have enough on your plate to not re-invent what others
 have already done pretty well. When you run out of features to
 implement, then think about this. Not until.

Seconded, in spades.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
engineering details.

On Oct 29, 2003, at 8:59 PM, Brad Knowles wrote:

	What about your backups?  And your off-site backups?  And your mirror 
sites around the world?  Any other copies of those files that might 
have been copied off somewhere else?


___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 9:08 PM, Brad Knowles wrote:

	Bryan Costales and Eric Allman had this debate at InfoBeat/Mercury 
Mail.  Bryan said that he could write a better simple MTA that could 
handle the easy 80% and leave the hard 20% to sendmail.
There is no such thing as a simple MTA. This gets hairy quickly. Really 
quickly.

you are much better off spending money on a good fast disk RAID (since 
the chances that you'll win the lottery are on par with the chances 
that your bottleneck is NOT disk I/O in mail sending) than on a 
programmer to try to build fast MTAs.

that none of the original Internet Mail Operations team members are 
left at AOL (almost all bugged out when the new mail server software 
came online), and I don't think any of the original Internet Mail 
Development team members are left, either.

And boy, does it show.



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 05:59:48 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 11:43 PM -0500 2003/10/29, J C Lawrence wrote:

 Not a whole lot more complexity.  You're just invalidating the
 pointed-to data as well as the key.  You're still not doing free
 space management.

 What about your backups?  And your off-site backups?  And your mirror
 sites around the world?  Any other copies of those files that might
 have been copied off somewhere else

I'm not going to touch the aspects of attempting to rewrite the data in
backup sets without invalidating the backups.  Uhh uhh.  No deal.  I'm
also not going to touch the management of data that has been copied
outside of the store's purview.  Its no longer in the store's scope and
so isn't really under discussion.  I can run strings on my Oracle tables
as well, but that really doesn't make the resulting data files part of
Oracle's data-management model.

At its core this is a snapshot issue.  What you're really arguing for is
the ability to revert, recover, or synchronise (they're all the same
thing under the covers) the state of the store in a logically consistent
fashion.  As such you're interested in logical consistency for not just
one Big File, but across files, and across the meta-indexes; logical
consistency of the store as a whole.  This really isn't a storage format
problem.  Its a transaction framing problem and a snapshotting problem
(which is really jut a transaction framing problem).  You need to not
only know the state of the data files, but the state of the
meta-indexes, and that they are synchronised with each other.

This is not a trivial space, but its also not an unknown space.  File
versioning systems have been messing here for years with change keys and
and signatures.  Ultimately it comes down to a shared transaction key.
The old ATT SCCS papers are a particularly good read in this regard.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 06:08:32 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 11:53 PM -0500 2003/10/29, Barry Warsaw wrote:

 Note that none of the original Internet Mail Operations team members
 are left at AOL (almost all bugged out when the new mail server
 software came online), and I don't think any of the original Internet
 Mail Development team members are left, either.

Eeek.  Not fun.

 I've been down this road before.  Trust me, you don't want to do this.

Barry, listen to this man.  He speaks sooth.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Chuq Von Rospach
On Oct 29, 2003, at 9:21 PM, J C Lawrence wrote:
This is not a trivial space, but its also not an unknown space.  File
versioning systems have been messing here for years with change keys 
and
and signatures.  Ultimately it comes down to a shared transaction key.
The old ATT SCCS papers are a particularly good read in this regard.

How does this statement reconcile with Barry's not wanting to require 
MySQL or PostgreSQL for Mailman because he doesn't want to layer on too 
many dependencies to get Mailman running? We seem to be heading off 
into places where the answer is if we're lucky, it'll run on that 
cluster of G5's at Uvirginia -- slowly.

Unless Barry wants to throw his simplicity requirements out the window, 
we can't expect high performance filesystems, SANs, fiber optic RAID 
connects, or for that matter, linux over windows over sgi over solaris 
2.5. This stuff that's floating around is great, if we were writing an 
enterprise-class, mega-bugger IS-supported system for a corporate data 
center.

How's taht all relate to Mailman, anyway? Maybe we should refocus and 
not wander down interesting but entirely philosophical ratholes?



___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 23:41:37 -0500 
Barry Warsaw [EMAIL PROTECTED] wrote:
 On Wed, 2003-10-29 at 23:17, J C Lawrence wrote:

 Sounds great.  Bring a laptop and we'll bang out some code (anyone
 else up for a mini-Mailman-3 sprint at my house? :).  I'll probably be
 heading to Fedex Field on the 9th for a 'Skins game, so the 8th would
 be perfect.

Hopefully I know by this Sunday.  Will see.

 Eeeek!  I trust this would be for immediate handoff to a real MTA
 versus handling final delivery directly?  Quite the Pandora's box if
 not.

 Yep...

In what way would this be different from the current SMTP delivery
supports?

 BTW Whatever happened to Michel Pelletier's interfaces PEP?  I see
 the draft, and I see signs that something got done, but not what...

 Dead in the water AFAIK.  

Ahh.

 Last I heard Nigel was still running screaming into the hills.

 Hey, I love Exim -- Greg's done some very cool stuff with it on
 mail.{python,zope}.org.  But man, I find it hard to track down just
 the right knob I need to tweak. :)

Hehn.  I like Exim a lot, and compared to the competition the
documentation is superb.

I got a note after my Mailman doesn't have enough config options that
he'd, err, had a somewhat explosive reaction.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Wed, 29 Oct 2003 21:31:51 -0800 
Chuq Von Rospach [EMAIL PROTECTED] wrote:
 On Oct 29, 2003, at 9:21 PM, J C Lawrence wrote:

 How's taht all relate to Mailman, anyway? Maybe we should refocus and
 not wander down interesting but entirely philosophical ratholes?

Agreed, but then I've said my piece several times on those scores.

We need a requirements definition for the abstractions for storage,
indexing and presentation.  I've already stated my bits there.  So far
there's been neither argument or commentary, just a bunch of
cross-purposes violent agreement between Brad and me.

While I like a netnews model as it suits my needs, I really don't care
what the store is so long as it solves the problems I've laid out.  We
need a priori key determination, a collision policy, key handoffs to an
indexer (which could be NULL in Chuq's MySQL case), and an
improved/adapted presentation layer.  I've already said my bits there
and proposed what I see as the cheap, easy, incremental improvement
course: Twisted's NNTP supports for storage, Message IDs for keys, a
variant best-effort detection and rewriting policy for collisions, and a
MeoWWW derivative for HTML presentation/posting.

Counters?

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 05:00:48 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:
 At 10:47 PM -0500 2003/10/29, Barry Warsaw wrote:

 SGIs XFS on Irix does a pretty good job, with hashed directory
 structures, and an extent-based journaling filesystem.  

ReiserFS also does particularly well here.  I haven't yet tested IBM's
JFS.  Last time I hit VxFS hard (back in the HP-UX 20.20 days) it really
didn't like huge directories, but that may have changed since then.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 9:16 PM -0800 2003/10/29, Chuq Von Rospach wrote:

 There is no such thing as a simple MTA. This gets hairy quickly.
 Really quickly.
	Bryan is one of the few people I would expect to be able to do 
something that could actually handle the easy 80%.  Writing the book 
_sendmail_ (now in its fourth edition) is just one of his many 
talents.

 you are much better off spending money on a good fast disk RAID (since
 the chances that you'll win the lottery are on par with the chances
 that your bottleneck is NOT disk I/O in mail sending) than on a
 programmer to try to build fast MTAs.
	They were already using pure RAM disks for this application. 
Disk I/O was not the problem.

	Bryan and Eric were two major contributors to my invited talks 
Sendmail Performance Tuning for Large Systems (see 
http://www.shub-internet.org/brad/papers/sendmail-tuning/) and 
Design and Implementation of Highly Scalable E-mail Systems (see 
http://www.shub-internet.org/brad/papers/dihses/).  These guys are 
not lightweights in this field.

 And boy, does it show.
	Indeed.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 05:51:32 +0100 
Brad Knowles [EMAIL PROTECTED] wrote:

 I'm sorry, I don't trust ReiserFS at all.  I'd trust XFS if it was on
 Irix, or IBMs JFS, but not ReiserFS.  Hell, on a Linux system, I'd use
 ext2fs before I'd use Reiser.

I'll simply note that I've been using ReiserFS on just over a dozen
systems ranging from million+ messages a day list servers to build, dev,
web, and desktop boxes.  I've yet to have problems.

 ... do you really want to try to force them all to use ReiserFS on
 Linux as their only supported option?

Err, want or consider reasonable?

 My understanding is that ext3fs is dead.  

I'd thought that Ted T'so took over some of the reins in his move to
IBM, but I haven't chatted to him in a long whiles.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread J C Lawrence
On Thu, 30 Oct 2003 00:56:29 -0500 
J C Lawrence J wrote:
 On Thu, 30 Oct 2003 05:51:32 +0100 Brad Knowles
 [EMAIL PROTECTED] wrote:

 I'm sorry, I don't trust ReiserFS at all.  I'd trust XFS if it was on
 Irix, or IBMs JFS, but not ReiserFS.  Hell, on a Linux system, I'd
 use ext2fs before I'd use Reiser.

 I'll simply note that I've been using ReiserFS on just over a dozen
 systems ranging from million+ messages a day list servers to build,
 dev, web, and desktop boxes.  I've yet to have problems.

Err, add in just under three years.

-- 
J C Lawrence
-(*)Satan, oscillate my metallic sonatas. 
[EMAIL PROTECTED]   He lived as a devil, eh?  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.

___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 12:49 AM -0500 2003/10/30, J C Lawrence wrote:

 ReiserFS also does particularly well here.  I haven't yet tested IBM's
 JFS.  Last time I hit VxFS hard (back in the HP-UX 20.20 days) it really
 didn't like huge directories, but that may have changed since then.
	HP-UX 20.20?  I wasn't aware that they had gone much beyond HP-UX 
11.x.  Did you mean HP-UX 10.20?  Now that's a beast I remember, and 
remember loathing with a passion.

	HP-UX 9 was slow, but rock-solid -- no matter how hard you beat 
on the damn thing, it just slowed down but never stopped.  HP-UX 10.x 
was a real dog.  HP-UX 11.x looked like it was going to shape up 
better, but then I got out of AOL before we had many of those systems 
in house.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


Re: [Mailman-Developers] Requirements for a new archiver

2003-10-29 Thread Brad Knowles
At 12:40 AM -0500 2003/10/30, J C Lawrence wrote:

 We
 need a priori key determination, a collision policy, key handoffs to an
 indexer (which could be NULL in Chuq's MySQL case), and an
 improved/adapted presentation layer.
	As far as this goes, I agree.

   I've already said my bits there
 and proposed what I see as the cheap, easy, incremental improvement
 course: Twisted's NNTP supports for storage, Message IDs for keys, a
 variant best-effort detection and rewriting policy for collisions, and a
 MeoWWW derivative for HTML presentation/posting.
	I don't know anything about Twisted or MeoWWW, so I can't say how 
they address the subjects above.

	I can say that I'm not sure about an NNTP-based storage solution, 
although certain storage techniques we've recently discussed borrow a 
lot from extant NNTP implementations, and I'm not sure how much sense 
it would make to rip out just those parts we know we need, or if we 
could actually reasonably take the whole thing, kit-n-caboodle.

	I do believe that we need an alternative solution to the 
message-id header as it was presented to us in the message, as a 
stable guaranteed unique (well, as good as MD-5 or SHA-1 gets) 
message identifier that can always be used to refer to the exact same 
message no matter what.  Whether we use this message identifier as a 
replacement for the message-id header value as it was presented to us 
-- I think that's a more philosophical discussion, and I think we 
should address it by allowing both options but deciding which would 
be a reasonable default to take.

	Given that the mailman UI is basically completely contained 
within the CGI, I'm inclined to leave it there and work on improving 
it internally, allowing us to continue to work with most any 
webserver the client may have.  I don't know how MeoWWW addresses 
this issue, either by replacing the webserver, or providing 
additional tools that may make it easier to present a good and 
consistent UI.

--
Brad Knowles, [EMAIL PROTECTED]
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety.
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++): a C++(+++)$ UMBSHI$ P+++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+() DI+() D+(++) G+() e++ h--- r---(+++)* z(+++)
___
Mailman-Developers mailing list
[EMAIL PROTECTED]
http://mail.python.org/mailman/listinfo/mailman-developers


  1   2   >