Re: [Mailman-Developers] Improving the archives
but if you can trust yourself to generate them, consecutive integers provide minimal, order-preserving, perfect hashing, too! Hmm this sounds pretty sensible to me. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Craig Loomis writes: Globally unique IDs, hashed IDs, etc., are very appealing from various CS-y and techie points of view, but are simply not memorable to humans or knowable by dumb external programs. I think as much, or more, effort should be put into delivering a straightforwardly useable naming scheme as goes into making an arbitrary message recoverable from anywhere. Basically, friendly URLs should be a primary requirement, not an optional afterthought for careless geeks like me to get wrong later Friendly URLs *are* a primary requirement. The point is that to make them *reliable* as well, either a globally unique ID is needed, or individual site admins must suffer through hard-to-document constraints on what they can do with their archives. Note that the system you describe based on the post_id member demonstrates the value of a unique ID. Sufficient reliability is not a tough requirement for an individual admin to achieve, as you have demonstrated. It's much more exacting for the Mailman developers, who need to satisfy both sites with different needs *and* archivers with different features. As an aside on other discussions, can you get away without using Message-ID or Date? No. Not all recipients of the messages get them through the list. Once again, Mailman developers have to consider that situation, while in your situation you may not need to worry about it. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Or Re: [Mailman-Developers 10417] Improving the archives I would like to interject and highlight some use cases for stable and predictable IDs. For us, message IDs are directly used both by people and ignorant programs. Our mailing lists serve as a permanent and concise record of our discussions, decisions, and operations, and we find it invaluable to be able to refer to individual messages in a simple and memorable way: message 1210 in the calibration list, say. Other people can then easily jot that info down or directly find the message. Some message IDs even become shorthands for a particular topic or decision. We have also added trac InterWiki templates pointing into our mail archives (as listname:number), which encourages desirable cross-referencing (PRs, wiki pages, and SVN change logs can refer to mail messages, just as wiki pages could always refer to changesets and PRs, etc, etc.) But trac InterWiki templates can only interpolate $1,$2,... arguments into strings, and could not possibly calculate anything based on the _content_ of the messages. Globally unique IDs, hashed IDs, etc., are very appealing from various CS-y and techie points of view, but are simply not memorable to humans or knowable by dumb external programs. I think as much, or more, effort should be put into delivering a straightforwardly useable naming scheme as goes into making an arbitrary message recoverable from anywhere. Basically, friendly URLs should be a primary requirement, not an optional afterthought for careless geeks like me to get wrong later We long ago added an extremely simple ID handoff between MM 2.1.8 and pipermail, and though imperfect it has served us well. Basically, we hijacked the .post_id member in mailman (otherwise basically unused, and mysteriously a floating point number); CookHeaders stuffed it into a X-Mailman-Sequence-ID header line, and AfterDelivery incremented it. In turn, pipermail uses the header to feed a sequence ID into make_article, and the message is squirreled away as $mailinglist/all/%d.html. There are a few other minor matters (e.g. post_id was added to Decorators, a couple of templates were changed, we lost having 'ls' sort chronologically [did we have to add .last and .prev to the HyperDatabase classes?]), but it really was a minor bit of work. And for stability, as long as the archive files aren't lost, pipermail rebuilds should yield the same URLs even if junk messages have been deleted. [Oh, we did also add a never rotate policy to our archives, but that is finesseable. ] As an aside on other discussions, can you get away without using Message-ID or Date? I.e., aren't those just more of those tokens which were standardized back before the Internet got tricky enough to invalidate the standards? Mailing lists serialize incoming messages, and so can generate their own unique and trustworthy IDs. UUIDs would work, but if you can trust yourself to generate them, consecutive integers provide minimal, order-preserving, perfect hashing, too! Anyhow, we have found that people will enthusiastically refer by name to individual messages within mail archives if they can. - craig ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Question: what about crossposted messages? Let's say a message gets sent to a list called mailman-developers with a CC to a list called pet-bunnies. Hypothetically, of course. Presumably, the person who got the message from pet-bunnies should probably end up at the pet-bunnies archive, where the message can be viewed in proper context; right before the processed carrots flamewar and after the manifesto on proper hopping technique. To make that work, I think we need some way to - at least optionally - allow one or more of the RFC 2369 headers to influence the archival URL. Reading the wiki, I guess that's where List-Archive comes into play? My other question is about the angle brackets. Barry, why are you inclined to include them in calculations? It's kind of arbitrary, but quoting RFC 2822, end of section 3.6.4: Semantically, the angle bracket characters are not part of the msg-id; the msg-id is what is contained between the two angle bracket characters. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
--On 2 October 2007 22:47:35 -0400 Barry Warsaw [EMAIL PROTECTED] wrote: One question: should the angle brackets on the Message-ID be part of the hash or not? I think they should, or IOW, the entire value of the Message-ID header is taken as the hash, though they should be stripped off if using the Message-ID in any kind of archive query. I'm open to suggestions though... comments? Mathematically, the two solutions are equivalent for valid headers, aren't they? OK, the hashes will be different, but only in a trivial sense. Technically, I imagine, it's going to be easier to handle bogus headers if you just hash the entire header. For example, what do you do if some piece of crapware gives you a message with a header missing the angle brackets? Or that adds something outside angle brackets? Or that includes a right-angle bracket in the message-id itself? You don't have to think about any of those situations if you either (A) reject the message or (B) encode the entire header. -- Ian Eiloart IT Services, University of Sussex x3148 ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote: Jeff Breidenbach wrote: 5.85 million messages That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I'd say that's a strong argument for just using the Message-ID and simplifying this tremendously... ...Barry, do you disagree? No, I'm convinced. Apologies for taking so long to respond. The code in the Mailman 3.0 branch has been updated to use only the Message-ID. I still think the base32-encoded sha1 hash is a good user-friendlier option but of course and that archivers should accept either. One question: should the angle brackets on the Message-ID be part of the hash or not? I think they should, or IOW, the entire value of the Message-ID header is taken as the hash, though they should be stripped off if using the Message-ID in any kind of archive query. I'm open to suggestions though... comments? (It can still be a base32 encoded SHA hash it to make it less user hostile.) http://wiki.list.org/display/DEV/Stable+URLs The wiki is down at the moment (I have a issue opened on the support tracker about that). When it comes up, I'll update the page. Thanks everyone for a very good thread, and especially for Jeff for doing the analysis on real data. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iD8DBQFHAwLI2YZpQepbvXERArC8AJ9xJAtqHQPwipUnZuMOvkQ2yxWa0QCbBf+D KnPkuOJEFTZD38BfupCLvk0= =/kr1 -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. I took a look at a larger dataset, 5.85 million messages from several thousand lists. Of the messages that share message-id but not date, most come from a small number of based web services. 875 come from forums.slimdevices.com 378 come from lists.openplans.org 265 come from nabble.com 164 come from egroups.com 135 come from yahoo.com 166 come from elsewhere That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I didn't try contacting Yahoo/eGroups because in my past experience, talking to a brick wall is easier. I have not analyzed how many of these messages are spam or have duplicate bodies, which further discounts the percentages. Hope this data helps. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach wrote: 5.85 million messages That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I'd say that's a strong argument for just using the Message-ID and simplifying this tremendously... ...Barry, do you disagree? (It can still be a base32 encoded SHA hash it to make it less user hostile.) http://wiki.list.org/display/DEV/Stable+URLs -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
704 messages fall into this category. Of these, 596 come from a single (malfunctioning and duplicate spewing) list server. I have not yet examined the remaining 208 messages, but I'll bet anything many also have duplicate message bodies. Or are spam. So for this data set, we have an upper bound of 0.01% messages in this category, possibly significantly less. Correction. ... remaining 108 ... 0.005% ... ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. It took longer than expected, but I now have numbers from looking at 2,151,896 messages spread over a few thousand lists. The appended script was run over a set of MH format raw messages. 704 messages fall into this category. Of these, 596 come from a single (malfunctioning and duplicate spewing) list server. I have not yet examined the remaining 208 messages, but I'll bet anything many also have duplicate message bodies. Or are spam. So for this data set, we have an upper bound of 0.01% messages in this category, possibly significantly less. Jeff #!/bin/bash # # Look for messages that # # Do collide with message-id # Don't collide with message-id + date DIR=/home/archive/Mail C1=0 C2=0 get_ineresting_messages() { cd $DIR/$1 for j in $(ls -U); do MSG_ID=$(cat $j | 822field message-id) MSG_DATE=$(cat $j | 822field date) if [ $MSG_ID != ]; then echo $MSG_DATE | $MSG_ID fi done |\ sort |\ uniq --separator='|' --skip-fields=1 --all-repeated |\ uniq --uniq } for i in $(ls $DIR | grep @); do DUP=$(get_ineresting_messages $i) DUP_CNT=$(echo -n $DUP | wc -l) MSG_CNT=$(cd $DIR/$i ls -U | wc -w) C1=$(( C1 + MSG_CNT )) C2=$(( C2 + DUP_CNT )) if [ $DUP_CNT != 0 ]; then echo echo === collisions/messages: $C2/$C1 $i echo $DUP else echo -n . 12 fi done -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach wrote: So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer was ... drumroll please ... 260 thousand. What the hell? I think the question you were originally going to ask got sidetracked. If we assume that all these multiple paths from list to archive duplicates not only share a Message-ID but also a Date (they were the same message originally, so they should!), then both schemes (messageid, and messageid+date) would decide that all (but one of) these messages are redundant. What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
If you improve the script or find numbers that lead to different conclusions, now's the time to know! Live and learn! So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer was ... drumroll please ... 260 thousand. What the hell? Time for a closer look. In some cases, the archiver was getting two copies of every message. For example, the MLM (mailman) was sending out a message to subscriber A and subscriber B, and both paths eventually lead to the archiver. In another case, the MLM (YahooGroups) spammed 20 copies of the same message to every subscriber, and modified the body of each one. YahooGroups tends create HTML mail and sticks ads, possibly spyware, and who knows what other crap in message footers. There's probably other categories I haven't noticed yet, 260k messages is a lot of checking. So you'd think the archives would be a complete mess. But they aren't and I had no idea anything was remotely amiss under the hood. That's because mhonarc only archives one message per message-id. So those 19 repeats from YahooGroups get thown away. This is actually a pretty robust strategy when you think about it; it keeps lots of annoyances out of archives and everyone who gets smited deserves it; accidental duplicates, malicious duplicates, broken mail transfer agents. Reasonable people can disagree, but I like it. So I'm amending my request. If mailman and pipermail++ want to keep a verbatim record of everything passing through the MLM, fine. But please make it also possible to interoperate with archivers that use the looser mhonarc strategy, e.g. allow the interoperability URL to collide when message-ids collide. Currently Stephen's proposal allows this, Barry's does not. Just to make things really concrete, here's an example from that YahooGroups collision I was describing. The 20 messages spammed to subscribers would all have a interoperability URL something like this (but perhaps not quite so enormously long) embedded in the message, in both headers and possibly a footer. http://www.mail-archive.com/search?l=estika%40yahoogroups.comq=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id Clicking on it, the user goes to the archive server. For this particular archiver, an HTTP 302 redirect takes the user to another URL which happens to be more human friendly. But the details of what alternate URLs are available - if any - is really up to the archive server. http://www.mail-archive.com/[EMAIL PROTECTED]/msg01341.html I think that's about it. I do kind of like Stephen's suggestion of allowing the archiver to supply a formuia for interoperability URL; if that's the case I'd say the RFC2369 headers could be fair game for use in the calculation. That allows cross posted messages to easily link to their correct archive - note how I used the contents of List-Post when creating the interoperability URL above. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
If you are relying on the sender to do the right thing, then why not force them to create proper message-ids? I think Barry's proposal is essentially a numbers game - e.g. he's hoping for significantly better results using Date in the calculation than not using it. http://wiki.list.org/display/DEV/Stable+URLs I'll try to tease out some more useful stats from some large datasets this weekend. (I can't just run the python scripts as is because I don't have python 2.5 in the same place as the data, I don't keep raw message in mbox format, blah blah blah, but we'll figure it out). My hypothesis is Date doesn't really buy much, but that's in part because I have a vested interest in that outcome. We'll see how the data plays out. And I still think RFC2369 headers are needed in the calculation if cross posted messages are to be handled correctly. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) Someone already pointed out that the message ID is a bit long for a URL, so I'm guessing we're going to want some sort of shorter sequence number for messages for linking purposes. Yes, definitely. What do you think of the base32 examples I have on the wiki page? Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. It seems silly to generate nice short links but then use message-id. If we can generate nice short links, we might as well use 'em throughout, unless you really think the default use of the archive will be to search it by messageid (which I sincerely doubt, from my user experiences). We'd want sequence numbers in the urls if we think people will hand edit them, say in a browser location bar. I'm not sure that's a common enough use case. Pipermail currently uses sequence numbers but there are big problems with that. First, the mbox'ing algorithm wasn't always correct so while sequence numbers were accurate when generating the html archives on the fly, they broke horribly when you try to regenerate them from an mbox file. It's also why we have tools like cleanarch which tries to unbreak earlier mboxing bugs by crufty heuristics. This /might/ be solved by ditching mboxes for maildir or some other canonical raw archiving format (not a bad idea in its own right), but manual surgery on the raw archives could still break it. Sometimes site admins just /have/ to remove messages, disrupting the sequencing. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdK2XEjvBPtnXfVAQKfDQP/ToPZ3t7+uIyMrsThOr+PVQ7aKVT/BQ7F OgKqFSDSma4ZofQOkPgr4ZFRT1yKRURWas7jI2zQ8ADPAOKCYh0Udgq6XjpOI8mI 7/pODazVkbwzT9Oo06pGwpzaONK4eZjt1y9IDb9VkniUcAyve5EQ+5+KaG3rbo4M wsrCnHLkvSE= =/z/f -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote: Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. I agree there's a lot of usability benefits from short URLs, but perhaps this is the job of the archive server, and not the list server. Mharc (an archive server) is a great example here. Mharc's canonical message format is pretty human friendly. http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html Unfortunately, there's no trivial way for the list server to know that human friendly URL when the message is sent out. Fortunately, Mharc is also happy handles messages by message-id, which the list server does know about. http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc- users[EMAIL PROTECTED] Had I been the implementer, I'd probably have made mharc do an HTTP 302 redirect from the longer URL to the shorter URL. But that's besides the point. The point is we have an existing, working, happy archival server, and it would be really nice if list servers (such as mailman) were compatible. And by compatible, I mean offering the capability of embedding an archival URL in the footers of messages. I agree, I just don't think message-ids are user friendly enough to be this canonical url. Especially in this context, which is exactly where urls are thrown in users faces. An archiving service is exactly the right place for redirecting human readable urls to the archiver's canonical url (by, I agree, 302). - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdLznEjvBPtnXfVAQJtxgQAiLp7TjnLoOLnpoxfli2gBo6fdU6ZIFb0 SKiuRgLAoTSdnJymYWOww2U/vTJ3HqR2dZNFCfGeVHgzoHpiX87WiZDJ4Sx1Jec8 7BpIO1ZokGI2NhHiSscYC5k4iCzce17lVGkyVzfYlFysmFKsFjcDIpV8wQFleeG9 TneLaMXT2eY= =1tKI -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) If it's not going to be canonical (I forget if there's a standard for that word :), what is the point in writing an RFC? I completely agree. Maybe interoperable is the right word to use. Or user friendly interoperable archive url which is really what we're trying to define here (IMO). There needs to be a way to *enforce* uniqueness, and it *must* be specified by the RFC in order for archive implementations to be interoperable. Note that word specify; I do not insist that this level of robustness be *required*. But if we don't specify it now, people who want such robustness will have to do all this work again, and possibly will end up with something that some servers conforming to your RFC will not conform to. Yep. It is possible that most archivers will simply use the message ID, and do something brutal in the rare case of a collision. That's fine. But an archiver that wants to provide a canonical URL which is guaranteed to uniquely and losslessly identify a post in its archive should have a standard way to do that. Yep. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. The footer URL is of no concern in this discussion. There is not going to be a requirement that footer URLs be canonical, not if I have any say in the matter. The canonical URL will be in (or be constructed from) the message header. Agreed in the sense that the RFC 2822 headers must contain all the information necessary to construct the canonical url (or must contain the canonical url). A list server /can/ decorate the message with the url in other ways, but that certainly isn't necessary. You might even imagine a mail reader extension that read the appropriate List-* headers and added a button View In Archive which sent the canonical url to your web browser. Once that happens, the archive service is free to redirect to its hearts content. I submit though that any good archive service (and certainly Pipermail++ if I can help it) will ensure that those urls are stable forever, otherwise people will stop relying on it. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdNWnEjvBPtnXfVAQIZRAP/Ux9rUK6ToH5Zl2XTC8LOKgCG+1yhf4pw h4XVZc0nmP1xxFttsXzsuY+/oGFW8yrY0yGnxK4N5EKUEpIxejGNbVtAjpQ5l/Sy ml5R5kDhZtk/d8tE9IXOzB5zCcxdmMgjX3KfL78t5L6JzAQ4RgM0MTYxPH69AdHW zpvhBCow/z8= =KiqU -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote: What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. No question, if the archive service and the list server are not intimately connected, the communication channel between the two can be subverted. There are ways that channel's trust could be enhanced though, for example by the list server signing its headers in a dkim- like fashion. But in situations where the two are co-located, you can trust these headers even without that enhancement. So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case. I've uploaded the script I used to here: http://wiki.list.org/download/attachments/786633/scan.py?version=1 It's probably not perfect, and certainly the python.org mbox's may not be representative enough of the real world. Please grab the script, tweak it and run it over your own raw archives; it should be easily modified to handle any of the mailbox formats supported by Python 2.5's mailbox module. If you improve the script or find numbers that lead to different conclusions, now's the time to know! and human friendlier urls. That's a very compelling point. SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of short URLs (short enough that they can comfortably fit inside message bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it. We're not concerned with the cryptographic security claims of SHA1. I don't see any economically beneficial attack on the archives against SHA1 here. I think SHA1 is reasonably universally available, and marginally better than MD5, so it's probably good enough for this application. You're right that no one is going to do SHA1 in their heads, and if they could, they're probably working for some TLA in a secret gubmit basement lab somewhere. The point of course is that a /program/ could easily apply the algorithm to a very minimal existing message and come up with the same canonical url. This enables all kinds of cool applications based on REST-y principles or whatever. The fact that the algorithm leads to short(ish), largely unambiguous (to humans), readable urls is an important benefit -- probably /the/ most important benefit. Throw it away or hide [Date]? The former would be a problem, but not the latter. Thrown away. Really? Wow. I'd have thought every archiving service would want to keep a record of the raw message it received on the wire. That would allow it to regenerate the html archive if necessary, provide useful forensics, and allow for exactly the kind of data mining we're doing here. I can't see /any/ reason for not saving the raw messages in their entirety, especially for a public list. Maybe for a private one, where your data retention policies require you delete things after a certain amount of time, but even there, I can't see why you'd want to trim raw messages rather than just chucking them entirely. My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage. What's the advantage of that? Isn't disk space cheap as dirt? Probably cheaper if you've bought any topsoil recently :). Still, the raw messages are still available right? So if there was enough value in calculating the canonical urls so that the archive service could be seen as an interoperability good citizen, then it could be done. I'll just reiterate that I'm not married to including the Date header in the algorithm. Until proven otherwise by more research, I think it's a good idea to use because 1) it's required by RFC 2822 and 2) it seems to reduce collisions. I think the algorithm I propose would work just as well with Message-IDs alone, although there's more of a chance that the non-sequence numbered url will return multiple matches. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2 KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad ERlOYR2onAQ= =8b8I -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ:
Re: [Mailman-Developers] Improving the archives
Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. This is my concern too. Especially since this is known information; it is trivial to be malicious. Whatever was done, I think would *have* to deal with 'dupes', in some form or another. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Hi, I think this is the first time that I'm posting here but hopefully not the last. Thanks to everyone involved for an incredible project. I'm not much of a developer but I like practical solutions and will do everything possible to help improve in this area even if it's just to give some feedback. I'm very excited about this project and can't wait for the next version to come out with full integration between web forum and mailing list. I like this idea very much and it seems that we're going to see it real soon. :) On 24/07/2007 18:43, Dale Newfield wrote: Jeff Breidenbach wrote: In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Oh--I was assuming the Date to which he was referring was the current timestamp at which mailman was processing the message. I was going to say that this guarantees uniqueness, but I guess there are parallel mailman implementations where more than one machine/processor are all serving the same list, and then two different machines/processors might wind up with identical timestamps while processing two different messages. I also like the idea of seeing the date somewhere in the URL but IMHO we also need to see a unique sequential number. How about the following idea: http://my.list.server/archivebase/mylist/200707240001/msg1/ http://my.list.server/archivebase/mylist/200707250001/msg2/ http://my.list.server/archivebase/mylist/200707250002/msg3/ and at the same time allow the following: http://my.list.server/archivebase/mylist/msg1/ http://my.list.server/archivebase/mylist/msg2/ http://my.list.server/archivebase/mylist/msg3/ This way you can see exactly how many messages were sent on a day and how many messages have been sent since the start. BTW the sequential number does in my view not have to be a decimal value. Anything short and sweet will do as long as you can work it out and at the same time allow for almost unlimited growth. Just an idea. Regards, Gustav H Meyer ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: I agree, I just don't think message-ids are user friendly enough to be this canonical url. Especially in this context, which is exactly where urls are thrown in users faces. An archiving service is exactly the right place for redirecting human readable urls to the archiver's canonical url (by, I agree, 302). I'm confused (to be precise, you're confusing me). If human readable URLs are exactly right for redirection to the canonical URL, why does the canonical URL need to be user friendly? A quick remark: the git SCM uses BASE16 SHA1s for object names, but allows you to abbreviate them to the unique prefix. A friendly archive could do the same for your BASE32 ids. Without going much into implementation, here's how I would write the conformance section for our RFC. The point is that I don't see any need to discuss user-friendliness or the implementation of UUIDs for the RFC! This means that getting those right from the start is not that important. 0. Conformance 0.1 List managers A conforming list manager MUST provide the List-Archive header field if the post is being archived. A conforming list manager MAY provide the List-Archive-UUID header field. If so, the value MUST be guaranteed unique, and it MUST be present in the post as provided to the archiver. The contents of this header need not be distinct from the contents of the Message-ID header, as long as the uniqueness guarantee is maintained. 0.2 Archives A conforming archive MUST reserve the namespaces message-id/ and list-post-id/ relative to its base URL for the uses described below. A conforming archive MUST support retrieval by Message-ID, using the namespace message-id/$(MESSAGE-ID) relative to its base URL. The archive specified in the List-Archive header field MUST support access using the value of that field as its base URL. A conforming archive SHOULD support retrieval by UUID, using the namespace list-post-id/$(LIST-ARCHIVE-UUID) relative to its base URL. If the scheme is http or https, a conforming archive that does not support retrieval by UUID SHOULD return status 501 NOT IMPLEMENTED with an entity explaining that retrieval by UUID is not implemented. A conforming archive MAY support friendlyurls for use where space is constrained (eg, in a post's footer). A conforming archive may support any other URIs it wants to, too.wink A third party SHOULD be able to regenerate a friendlyurl from the original message contents. 0.3 Software Conforming archive software SHOULD provide interfaces for generating UUIDs and friendlyurls, if retrieval is supported. Conforming list managers SHOULD use these interfaces. Some comments: The interfaces for generated URLs should be provided as command line utilities as well as callable functions. Although the conformance level for friendlyurl support is may, I expect that essentially all archives will support friendlyurls. The namespace for UUIDs and friendlyurls should probably be more restricted than any valid URI. List manager denotes any source of archival content (eg, you could imagine a user storing their outbox in a archive, so that the list manager would actually be the user's MUA). The namespaces suggested above are good enough, I think, but there may be better ones. Instead of 501 NOT IMPLEMENTED, I considered 410 GONE, but that implies a request to delete the reference. Since this is implemented as a header in the post, the archive could be augmented to support it later. In the phrase guaranteed unique, guaranteed means to the level provided by uuidgen or standard Message-ID generators. Generation of friendlyurls or unique ids based on message body content is probably a bad idea. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: Yes, definitely. What do you think of the base32 examples I have on the wiki page? They're somewhat better than Message-IDs for readability, but they're not user-friendly. On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: It seems silly to generate nice short links but then use message-id. The use case for the message-id is not people. It's software, which doesn't much care about nice short. But the developers debugging and maintaining the software will thank us for the ease of verifying that the URL goes to the right place. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If that's not the case, the mail transfer agent is broken. I think it's better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. This is a cost/benefit thing; the cost is some real world collisions, the benefit is a conceptually simpler system. Conceptually simpler things are good especially when implemented all over the place. Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc. Thoughts? Jeff While I'm almost tempted to ignore a hit rate that low, if you think of an archive holding 1B messages, you still get a lot of duplicates. OTOH, the rate goes down even lower if you consider the message-id and date headers. (Note, I did not consider messages missing a date header). How likely is it that two messages with the same message-id and date are /not/ duplicates? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive. I spent a /little/ time looking at the physical messages that ended up as true collisions. Though by no means did I look at them all, they all looked related. For example, with strategy 2 some messages look like they'd been inadvertently sent before they were completed. I need to see if there's any similarities in MUA behind these, but again, I think we might be able to safely assume that collisions on message-id+date can be ignored. That leads me to the following proposal, which is just an elaboration on Stephen's. First, all messages live in the same namespace; they are not divided by target mailing list. Each message has two addresses, one is the Message-ID and one is the base32 of the sha1 hash of the Message-ID + Date. As Stephen proposes, Mailman would add these headers if an incoming message is missing them, and tough luck for the non-list copy. The nice thing is that RFC 2822 requires the Date header and states that Message-ID SHOULD be present. Why the second address? First, it provides as close to a guaranteed unique identifier as we can expect, and second because it produces a nearly human readable format. For example, Stephen's OP would have a second address of mid '[EMAIL PROTECTED]' date 'Wed, 04 Jul 2007 16:49:58 +0900' # XXX perhaps strip off angle brackets h = hashlib.sha1(mid) h.update(date) base64.b32encode(h.digest()) 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI' I like base32 instead of base64 because the more limited alphabet should produce less ambiguous strings in certain fonts and I don't think the short b64 strings are short enough to justify the punctuation characters that would result. While RFC 3548 specifies the b32 alphabet as using uppercase characters, I think any service that accepts b32 ids should be case insensitive. A really Postel-y service could even accept '1' for 'I' and '0' for 'O' just to make it more resilient to human communication errors. I'd like to come up with a good name for this second address, which would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I think there's a reasonable argument to make that for well-behaved messages, that's exactly what this is. So now, think of the interface to a message store that supports this addressing scheme. Well it's something like: class MessageStore(Interface): def store_message(message): Store the message. :raises ValueError: when the message is missing either the Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store already has a matching Message-ID and Date. An archive is free to raise this exception for duplicate Message-IDs alone. def get_message_by_global_id(key): Locate and return the message from the store that matches `key`. :param key: The Global ID of the message to locate. This is the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None if there is no such match.
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. Fortunately, a rule more honored in the observance than the breach. Nonetheless, it *is* breached. The Postel Principle applies here, IMO. better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. That's not the point. We're not going to impose this on senders; that's what Message-ID is for, as you say. If a sender won't provide a proper Message-ID, third parties who get a CC are just out of luck. I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers. Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. I think Barry already suggested that? Anyway, +1. But remember, a standards-track RFC should have a working implementation to point to. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
st == Stephen J Turnbull Re: [Mailman-Developers] Improving the archives Tue, 24 Jul 2007 15:56:35 +0900 st Jeff Breidenbach writes: Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. Message ID's are supposed to be unique. st Fortunately, a rule more honored in the observance than the st breach. Nonetheless, it *is* breached. The Postel Principle st applies here, IMO. Taking be conservative in what you do as being at least as important as be liberal in what you accept from others, the devil can quote this scripture to support simplicity in this instance, IMHO. better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. st That's not the point. We're not going to impose this on st senders; I read the quote as meaning this time we mean it really is unique, imposing nothing on senders. st that's what Message-ID is for, as you say. If a sender won't st provide a proper Message-ID, third parties who get a CC are st just out of luck. Right. Maybe that will encourage compliance. The complexity of catering to brokenness in this instance may be too high a price to impose on the all. jam pgpVlVlfc9EJj.pgp Description: PGP signature ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
John A. Martin writes: better to go ahead and use the mesage-id, rather than concoct yet another this time we mean it! unique identifier. st That's not the point. We're not going to impose this on st senders; I read the quote as meaning this time we mean it really is unique, imposing nothing on senders. Ah. If so, my reply is if you want something done right, do it yourself. *All robust databases assign a unique ID to each record.* Why shouldn't a mailing list archive do so? Right. Maybe that will encourage compliance. The complexity of catering to brokenness in this instance may be too high a price to impose on the all. What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() (or however the message ID generator is spelled). After that, it's up to the archiver whether to do anything with it or not. I proposed a way that it could be used; if that's considered too complex, fine. But simply assigning one is not complex or otherwise very costly. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
There are three different parties coming to the table. One is the mail transfer agent of the sender, another is the list server, and the third is the archive server. Ideally, all three will be happy campers. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler. From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics. Put another way, there's the possibility to reduce the archive servers' implementation to search for this mesage-id which is something really useful to have anyway, and therefore likely to get wider support. In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away. So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach wrote: In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Oh--I was assuming the Date to which he was referring was the current timestamp at which mailman was processing the message. I was going to say that this guarantees uniqueness, but I guess there are parallel mailman implementations where more than one machine/processor are all serving the same list, and then two different machines/processors might wind up with identical timestamps while processing two different messages. -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) Someone already pointed out that the message ID is a bit long for a URL, so I'm guessing we're going to want some sort of shorter sequence number for messages for linking purposes. Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. It seems silly to generate nice short links but then use message-id. If we can generate nice short links, we might as well use 'em throughout, unless you really think the default use of the archive will be to search it by messageid (which I sincerely doubt, from my user experiences). Terri ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. I agree there's a lot of usability benefits from short URLs, but perhaps this is the job of the archive server, and not the list server. Mharc (an archive server) is a great example here. Mharc's canonical message format is pretty human friendly. http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html Unfortunately, there's no trivial way for the list server to know that human friendly URL when the message is sent out. Fortunately, Mharc is also happy handles messages by message-id, which the list server does know about. http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users[EMAIL PROTECTED] Had I been the implementer, I'd probably have made mharc do an HTTP 302 redirect from the longer URL to the shorter URL. But that's besides the point. The point is we have an existing, working, happy archival server, and it would be really nice if list servers (such as mailman) were compatible. And by compatible, I mean offering the capability of embedding an archival URL in the footers of messages. -Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 22, 2007, at 12:33 PM, Terri Oda wrote: On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go, given that people use them all the time without much trouble and with a fairly minimal amount of whining. ;) I like this for several reasons. I've long wanted a bridge between the traditional mailing list and a forum because to me they're related along a spectrum of emotional investment. What I mean is this. For the subjects and projects I care deeply about, I join the mailing list. I want to be intimately involved in the day-to-day collaboration that being subscribed gives me. I care enough about that that I'm willing to put up with the pain that comes along with mailing lists, such as the overhead for subscribing, deleting topics I don't care about, the occasional spam, the overhead of going on vacation or leaving the list, etc. But there are even more topics or projects that I have only a fleeting interest in. Say I find a bug in some X program, or wake up and decide to learn how to use setuptools, or find that some recent update broke my Linux server. In all those cases, I might want to start a thread of discussion or ask a question, and be very involved in that thread for a week or two. Then, my interest wanes, or I get my question answered, or other projects pique my interest. Mailing lists are pretty bad at managing those kinds of fleeting involvement, but forums are quite nice. There's usually fairly low overhead (and probably even less if OpenID and such were in widespread adoption) for joining, and when I lose interest the forum doesn't fill up my inbox. OTOH, forums seem good for short 'instant' messages, but not so good (IMO) for free ranging, detailed discussions. So there's a spectrum. I'm trying to use interfaces to things like comment systems (which are often threaded -- picture the slashdot stuff, maybe?) and popular boards like phpbb (which isn't threaded beyond separate topics) as guides to how people usually deal with conversations on the web. It'd actually be fairly easy, at that point, to just put a posting interface into the archives (yes, you'd have to be logged in, and yes, this means your password becomes that bit more valuable because someone having it can pose as you to the list... but they could do that by spoofing your email address so I'm not too concerned). But then people who don't like email or just want to pop by and check the list quickly could actually use mailman like a web board, which is something I'm pretty sure would get used (I know my users have asked for it in the past). Heck, /I'd/ use it, so what more justification do we need? :) I've been drafting simple prototype interfaces in my head, trying to keep potential architectures in mind. I'm hoping I'll have time this week to code some up HTML and see how well they actually work when they're not just inside my head. :) I'd love to see the prototypes once you've committed them to HTML. The one important thing is that the individual postings will need the equivalent of a stable archive URL (i.e. permlink) that could be passed around, added to web pages, etc. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj 8Y/9XxPjX5Q= =IRq2 -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote: Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc. I've always thought that an RFC-like spec that describes how a generic mailing list manager would interoperate with a generic archiving service is the way to go. I've written up a somewhat more formal spec of what I've implemented MM3 currently here: http://wiki.list.org/display/DEV/Stable+URLs If this looks good, I'd be happy to approach some of the related communities to try to get buy-in. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3 8CmG/bB9LTo= =EyoU -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote: I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers. I think there's two approaches we could argue for. One is for the mailing list manager to craft a UUID out of whole cloth and stick that in a header. Then any downstream archiver would be obliged to use that header value as the canonical address of the message, with an alternative path to the message via the Message-ID (possibly returning a list of matching messages when there are collisions). The second approach, and the one that I favor, is to use the Message- ID (and the Date) header on the original message as the UUID, properly handling corner cases like duplicate headers or missing header. This UUID servers as the basis for the address to the message resource just like above. I like the second approach better because in the case where you start with an off-list copy of the message, you have a decent enough chance of getting to the archived message, or at least to a resource containing a link to the message. The first alternative would require access to the list copy. Imagine if every archiver supported my proposal, knowing just the Message-ID and Date header, you could get to that message from almost anywhere, just by using the UUID as a relative URL rooted at say http://www.mail-archive.com, http://groups.google.com, http:// mail.python.org/pipermail, or whatever. That would be pretty neat. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3 N5iq3BWoMK8= =fSNC -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote: What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler. True, but an archiver already has to handle collisions on the Message- ID so in a sense, you have to maintain multiple paths to the same message, don't you? So I like my proposal because it imposing nothing additional on the MUA or MTA, a tiny bit more on the MLM, and some extra work (though I think not much) on the archiving agent. What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy, and human friendlier urls. From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics. All these are still true with my proposal, except with the observation as Stephen points out that given a URL based on sender- provided headers, you must be prepared to deal with collisions, so sometimes your resources will return lists. The advantage of adding a bit of MLM-provided information is that given the list copy you can guarantee uniqueness, and given the off-list copy you can get to a resource that contains a link to the message you want. Put another way, there's the possibility to reduce the archive servers' implementation to search for this mesage-id which is something really useful to have anyway, and therefore likely to get wider support. In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away. Throw it away or hide it? The former would be a problem, but not the latter. Does your archiver keep a canonical copy of the message as you received it? If so, then you preserve the original Date header enough for the calculation to occur, even if you hide the Date header, or display a Received header date when you render it to HTML. That doesn't matter of course. But I should point out that I'm not married to including the Date header in the hash. I like it because it appears to reduce collisions which I care about. But I still like using the base32 sha1 hash instead of the raw Message-ID because I think it's easier for humans to use, read, speak, and copy. Of course this doesn't mean that you need to disable your search-by-Message-ID feature! So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. Another advantage for the URL scheme I propose. You know you're going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- seqno (32 == base32(sha1digest(data)) (1 == / divider) (#digits-in-seqno == e.g. len(str(seqno)) You should be able to keep things in the 60-70 character range, including the host name. That doesn't seem too bad. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT 1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT 1/qaGckINUg= =4uwH -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives:
Re: [Mailman-Developers] Improving the archives
Jeff Breidenbach writes: So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. It is the archive server's job to decide what is the canonical URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) If it's not going to be canonical (I forget if there's a standard for that word :), what is the point in writing an RFC? What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. The implementations are similar, and there is nearly a one-to-one correspondence. But the semantics are very different. Message-ID is untrustworthy, the internal ID is trustworthy. So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. Go ahead and stick with message-id if *you* like, but please don't tell *me* what risks I have to accept. There needs to be a way to *enforce* uniqueness, and it *must* be specified by the RFC in order for archive implementations to be interoperable. Note that word specify; I do not insist that this level of robustness be *required*. But if we don't specify it now, people who want such robustness will have to do all this work again, and possibly will end up with something that some servers conforming to your RFC will not conform to. It is possible that most archivers will simply use the message ID, and do something brutal in the rare case of a collision. That's fine. But an archiver that wants to provide a canonical URL which is guaranteed to uniquely and losslessly identify a post in its archive should have a standard way to do that. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. The footer URL is of no concern in this discussion. There is not going to be a requirement that footer URLs be canonical, not if I have any say in the matter. The canonical URL will be in (or be constructed from) the message header. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. Sometimes the channel between the MLM and the archive server will be SMTP, and spurious messages can be injected. Finally, from the archive server's perspective, some of the MLMs might make mistakes - just like from the MLM's perspective, some of MTAs might make mistakes in setting message-id. So I don't think the proposed SHA1(date, message-id) scheme buys a hard guarantee of uniqueness. Every component has to protect themselves, but none can solve the world's problems. So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case. One should also not count collisions of messages going to different lists. Here's why. Let's say message M is cross posted to lists L1 and L2. Even though it is the same message, there are now two different contexts. (For example, people visit M at archive L1 should get a completely different experience if they hit next message and people visiting M at archive L2.) So I'd be curious what the collision numbers come to with these two factors taken into account. The other takeaway is list name really should be part of the URL to get proper context. The earlier example from Mharc does this. and human friendlier urls. That's a very compelling point. SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of short URLs (short enough that they can comfortably fit inside message bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it. Throw it away or hide [Date]? The former would be a problem, but not the latter. Thrown away. My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage. Of course this can be changed for the future messages with some pain, but there's no reasonable way for myself (or any other mhonarc users in the same predicament) to retrofit against Date based URLs. For the record, here's what mhonarc embeds in each HTML page it produces because these were considered the important headers. In this message sent from Australia, the date shows a timezone of UTC -0700, because it was pulled from the received header. !-- MHonArc v2.6.15 -- !--X-Subject: [Gossip] Re: green#45;travel resources {webliographies} -- !--X-From-R13: [nephf Z. Saqvpbgg zraqvpbgNlnubb.pbz -- !--X-Date: Wed, 26 Apr 2006 00:27:27 #45;0700 -- !--X-Message-Id: [EMAIL PROTECTED] -- !--X-Content-Type: text/plain -- !--X-Reference: [EMAIL PROTECTED] -- !--X-Head-End-- So my main request is to double check the numbers, see if using Date really buys as much as one thinks. I'll keep digesting the other aspects of the wiki page. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go, given that people use them all the time without much trouble and with a fairly minimal amount of whining. ;) I'm trying to use interfaces to things like comment systems (which are often threaded -- picture the slashdot stuff, maybe?) and popular boards like phpbb (which isn't threaded beyond separate topics) as guides to how people usually deal with conversations on the web. It'd actually be fairly easy, at that point, to just put a posting interface into the archives (yes, you'd have to be logged in, and yes, this means your password becomes that bit more valuable because someone having it can pose as you to the list... but they could do that by spoofing your email address so I'm not too concerned). But then people who don't like email or just want to pop by and check the list quickly could actually use mailman like a web board, which is something I'm pretty sure would get used (I know my users have asked for it in the past). I've been drafting simple prototype interfaces in my head, trying to keep potential architectures in mind. I'm hoping I'll have time this week to code some up HTML and see how well they actually work when they're not just inside my head. :) Terri ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Terri Oda wrote: I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go For public lists, the answer may lie in external tools like nabble.com or mailinglistarchive.com Of course, that doesn't help for lists wishing to keep their content private. -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote: Cool. I wonder if lurker is compatible with Python 2.5's mailbox.Maildir implementation and whether the two could share the maildirs. Thanks for the information! It had better be -- Maildir has a published specification. If there's an incompatibility, that would be a bug in either mailbox.py or lurker. --amk ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. I'd suggest the reverse. Keep the canoncical archive URL short and sweet, and then use a URL redirection service to map message-id's to those URLs. It is the archiver's job to make it all work. For example, the canonical archive URL might stay exactly the way it is in pipermail. But the archival link embedded in the message would instead go to a redirection service. I agree. My proposed global message id is exactly the canonical archive URL, although it's relative to the archiver's base url, as given in the List-Archive header. http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html http://mail.codeit.com/[EMAIL PROTECTED] The one other thing I'd ike to revisit is integration with third party archival services. There are two obvious integration points; one is a button in the Mailman list admin user interface that says archive with service X not unlike the setting in Firefox that basically says search with service X. I think we could define an interface that archive services would have to meet in order to be available to list admins. The site admin would of course have to enable them site-wide first. Why kinds of information would be required? - - List-Archive base url - - Message injection procedure - - Additional subscription procedures The nice thing is that if my global id idea works, the injection process can be completely asynchronous. The other integration point is the archival link discussed above. In which case it would be set to something like. http://third-party-service/[EMAIL PROTECTED] All we'd need to know is the third party's List-Archive header value. The last part of the path would always be the global message id. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCqSnEjvBPtnXfVAQJq7gQArkmEb3DqrOaRTdYnQ0SCOrqWtiPxNJOd 555+JiHt/mEqPTuS/cF1GfdckwrQXbUJYWeO56dXzfbXtCVaW54h4k/95RI2/mqK HR2BKcoVW/dDfYUd2V2Vbqdc7trVIy3oGdzQb24Pu9bIptqbdVSpnmx8jm9GIOi1 UAkJp+Ff5nc= =lE32 -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote: Barry Warsaw wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. The resistance to basing this on message-id has always been that there's no guarantee of uniqueness... ...but I believe each list has some sort of counter for how many messages it's seen, so we could add another header with that number, and use as a unique id the two concatenated together... (That way the archiver can know from the content of the header exactly how to generate the same unique id as mailman, which would allow for the url-in-the-footer to happen w/o first hitting the archiver.) I'm not crazy about this idea for a couple of reasons. First, it means that someone who has a copy of the message that didn't come from the list (e.g. one of the two you will get of this message), cannot calculate this unique ID. Second, things can happen to a list that might cause this sequence number to get corrupted. Maybe a list will get deleted and then recreated. Maybe it will get moved and the sequence number will get reset in the move. Maybe the list will be upgraded to a new version of Mailman. I think we can do just as well by using Message-ID + Date and get very low collision rates. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCobXEjvBPtnXfVAQIHFQP/Sz6WVqyFmo0lraw0hyyP5x4AhgBPDQmA /rFfSBRGbdORLXA2Ss0YdhI5cy8n7LMSsLawgtSt+JA7F5IEiC6Hk5C1M8C+Oe09 4ICYEuuL+gcXPPVc4aYtxp33HvPBFCzPJkGBS2PHaqCQkYIKdWHCtDZ8iLWCOxjc b674lsQk9tM= =a09C -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 8, 2007, at 1:06 AM, Paul Wise wrote: My personal opinion is that pipermail should be removed and mailman should not contain a default archiver since there are plenty of good archivers already (lurker, mhonarc etc). Adding wrappers around them would be simpler than reimplementing them. My hesitation to this has always been the turnkey question. Pipermail has it's problems but it /does/ allow small sites to get going very quickly with a full(-ish) solution. It may be that most people get their Mailman installation from their distro or hosting service and this is no longer as important. In that case, I still wouldn't chuck Pipermail, but I would try to see if we can adopt Jeff's goal of making the archive selection pluggable and easily selectable by list admins. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCt4HEjvBPtnXfVAQJHQwP+P4KAQaA7uEeISQjFyb3zoMvOWwgoW3zH taWsnVAhVmAF/hJBWDn7JtXwWiLw7ngCtGHp3MBKGBKzBjJP7ZizEMNfziaB+OoO LOyF7sYB+KhKVi+Il7XnHYIjh6DSD8kullP+G/UNtuIsFnNs+aTntndfMKJG2Zct E7M0F1Ok8FE= =xXQJ -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote: John A. Martin writes: In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when archiving the message rather than leaving it to the outgoing MTA? Quite. My reason for saying last resort is simply that this is not predictable to third parties. Eg, I send you (a non-subscriber) a message with CC and no Message-ID. You'd like to find the thread in the archives. You may as well just do a linear search on that month's threads. Yep, and I say tough. Let John complain to Stephen to fix his MTA to add those Message-IDs so Mailman doesn't have to. ;) An URL based on an MD5 of the message body in theory would work, but in the presence of non-ASCII bodies, structured MIME, ML digests, and various MTA autoconversions, that seems fragile. Agreed, and it would do no better, in fact worse, than base32(sha1 (message-id + date)) - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCuW3EjvBPtnXfVAQKx/AP9EUxDQmp1tiCEqJqVSFWeicq/9lThnMZN 58UUEPA47wPa1SJSk6z7+0vSfqTskwO1Frnn8OJ6X+MJAxCX4Hr86uBOnK9XW2AK byCfeYHBdapGlrsxmPd0so+FFJODWWRu7+yyKTw6ApDwVevatEEIMPlZkMALMv5S axC5ttHfR2E= =c0pw -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 5, 2007, at 12:09 PM, John Dennis wrote: A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the archiver used by Debian. So if you want to leverage existing open source archiving or at least look at an example of what would be necessary to allow easy easy external archiving integration with Mailman you might want to look at Lurker. I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping Mailman with Lurker because it's something we don't control and it's not Python. But I would be totally open to working with the Lurker developers on creating an easy bridge between the two systems. Perhaps this dovetails with Jeff's suggestion of easier integration with external archiving systems. Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going? (The same goes for any other archiver out there too.) - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCtBnEjvBPtnXfVAQLgJwP9HNu/r/5YYAGn0HcQAhD8b8plDSpm2tao VcC7tROs0EyjRAQd1b3+hF102FMZzTXF/8LifgETN8K4MD9TXkxNhrTlKjmAUhLG 1tvHZT9oD73aLb81m2SuI3nbp8kQSMncPeMM4u1vGzpXfCYGK4chAPyIJ1Z5MNqj 6byAgVpwZEo= =qjmf -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: First, I want to avoid talking about file system layout. To me, that's an implementation detail we needn't worry about right now. Agreed. How likely is it that two messages with the same message-id and date are /not/ duplicates? For message id generators that include a time-stamp in the generated id, approximately the same as the probability that two messages with the same message-id are not duplicates, no? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive. I'd rather not go there. There may be applications for the archiver that require that all mail received be filed. Counterproposal: have a collisions namespace, and provide an interface for the list owner to decide what to do with them. They could be thrown away, they could be given an alternative global ID somehow and added (eg, the archive page could add a See probable duplicates too link), or they could be put into a moderation-like queue for list admins to decide about. So now, think of the interface to a message store that supports this addressing scheme. Well it's something like: I don't understand how the calling application is supposed to deal with a DuplicateMessageError exception since it should not change either the Message-ID or the Date if present. I see this as a major problem with any proposal to use only author headers in computing the global id. Or by using the global id, or by rejecting messages with duplicate message ids. Er, the MTA has already accepted it. Do you plan to generate a list manager bounce to the poster? This has the unpleasant misfeature that it could be used to bounce spam off the list manager, since the poster needs to see content to determine whether this is a multiple send or actually the intended version after a fat-finger send; we already know the message-id isn't good enough. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: Second, things can happen to a list that might cause this sequence number to get corrupted. Add an X-Mailman-Sequence-Number header if not already present. That doesn't deal with your other comments, but as I point out elsewhere, if you don't use *any* Mailman-specific information in the global ID, you have no sane way to handle collisions except throw them away (or make the global ID refer to a collection resource, but that's kinda unintuitive). ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote: How likely is it that two messages with the same message-id and date are /not/ duplicates? For message id generators that include a time-stamp in the generated id, approximately the same as the probability that two messages with the same message-id are not duplicates, no? Good point, though clearly not all message-ids have timestamp information in them. It does help explain why I see 600-odd more collisions when taking other data into account too. I've modified my script to sort collisions and dupes into maildir folders, so I'll take a closer look when that finishes running (it takes a long time to slog through all 5 mboxes, even on a fairly zippy dual-G5). Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive. I'd rather not go there. There may be applications for the archiver that require that all mail received be filed. True. It would ultimately be an archiver policy though. Counterproposal: have a collisions namespace, and provide an interface for the list owner to decide what to do with them. They could be thrown away, they could be given an alternative global ID somehow and added (eg, the archive page could add a See probable duplicates too link), or they could be put into a moderation-like queue for list admins to decide about. I like this. So now, think of the interface to a message store that supports this addressing scheme. Well it's something like: I don't understand how the calling application is supposed to deal with a DuplicateMessageError exception since it should not change either the Message-ID or the Date if present. I see this as a major problem with any proposal to use only author headers in computing the global id. Mailman would probably log and ignore DuplicateMessageErrors. It wouldn't be Mailman's responsibility to ensure the message gets archived, although I concede that as currently defined, you could end up with list copies that had a global id header that wasn't unique. OTOH, if the archiver implements a collision resolution policy such as a 'collisions' namespace, it wouldn't ever raise DuplicateMessageError. Or by using the global id, or by rejecting messages with duplicate message ids. Er, the MTA has already accepted it. Do you plan to generate a list manager bounce to the poster? This has the unpleasant misfeature that it could be used to bounce spam off the list manager, since the poster needs to see content to determine whether this is a multiple send or actually the intended version after a fat-finger send; we already know the message-id isn't good enough. Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce. But it would have to be subject to the same bounce rules as any other auto-response which could be used as a spam vector, e.g. limit the number of bounces per time period and don't include the entire original message in the bounce (as both can be, and are used as spam vectors). - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw 8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt EBp5YCMqxv8= =5tjc -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 20 Jul 2007, at 13:39, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I'd be inclined to agree wrt user interface. Documentation regarding this, and anything else to do with lurker, appears somewhat scarce - speaking as someone who has just migrated the exim.org lists to using lurker archiving. [previously we used mailman with the MHonArc/pipermail hybrid] I am considering starting a set of pages within our wiki about use of lurker (we tend to cover almost everything else about mail so why not that). Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping Mailman with Lurker because it's something we don't control and it's not Python. But I would be totally open to working with the Lurker developers on creating an easy bridge between the two systems. Perhaps this dovetails with Jeff's suggestion of easier integration with external archiving systems. Integration with externals feels like a good way to go. Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going? The ML appears... lacking in vigor.. BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time... Nigel. -- [ Nigel Metheringham [EMAIL PROTECTED] ] [ - Comments in this message are my own and not ITO opinion/policy - ] ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 13:39, Barry Warsaw wrote: I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. I'd be inclined to agree wrt user interface. Documentation regarding this, and anything else to do with lurker, appears somewhat scarce - speaking as someone who has just migrated the exim.org lists to using lurker archiving. [previously we used mailman with the MHonArc/ pipermail hybrid] I noticed that! There's no documentation link on the site. I also saw your question regarding getting a message out of lurker given its message-id. When I checked yesterday I didn't see a response. I am considering starting a set of pages within our wiki about use of lurker (we tend to cover almost everything else about mail so why not that). That would be cool. Feel free to add a link to your pages on the Mailman wiki, perhaps here: http://wiki.list.org/display/DOC/Home Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going? The ML appears... lacking in vigor.. BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time... Obviously Mailman can't know the second and third parts so it can't use them in its list copies. I dislike using YYYMMDD because of the high number of collisions. I should make clear that what I'm really proposing is not specific to Mailman or any particular archiver. It's really an interface to a generic message store. We succeed by convincing other mailing list software and archivers to adopt the same standard so that they can interoperate seamlessly. We can perhaps have the first implementations of this defacto standard (any latent RFC shepherds out there? :). We get everyone else to adopt it when we take over the world. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDGNHEjvBPtnXfVAQIwVQQAlwcmmuoXz/vKlpdu27wCHnfpwhhrQMmn DWMEayuJsG+qg3GvkwyHGkgTBalENdDWWAQpPE9Zf9nmY24FyqhqRpe/QhOCajBV 4+lvXR1FARur4y4E9Lzcjz1TzX3lkaxx3dVCqpOtJxNVVvv442eYsLf11E3Z+wxY m+ootMkR5pE= =y4za -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote: Barry Warsaw writes: Second, things can happen to a list that might cause this sequence number to get corrupted. Add an X-Mailman-Sequence-Number header if not already present. That doesn't deal with your other comments, but as I point out elsewhere, if you don't use *any* Mailman-specific information in the global ID, you have no sane way to handle collisions except throw them away (or make the global ID refer to a collection resource, but that's kinda unintuitive). I'd probably call it X-List-Sequence-Number and I'd have to ensure that archive copy had that header in it. OTOH, if I'm going to go to the trouble of adding this sequence number, why not just calculate a (more likely) gid for the message myself? If I did that, I could use a tinyurl scheme and get much shorter urls. The archiver would then be obliged to use my X-List-GID header verbatim. I've been pushing for calculating this using non-Mailman headers because I'd /like/ for a client receiving the non-list copy to be able to make the same calculation. OTOH, maybe we can have it both ways. So, we calculate the sequence number and generate the following headers: X-List-Sequence-Number: 801 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI The latter is composed of purely author generated data, the former is supplied by Mailman. Assuming we also had this header: List-Archive: http://archive.example.com/gid/ then the following url would point to the same exact resource: http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801 If however we subsequently got a collision, then these two urls would address different resources. E.g.: X-List-Sequence-Number: 2112 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI Now the two messages would still be addressable by their respective urls: http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801 http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112 but http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI would be a disambiguation page. For a web u/i it would be an HTML list containing relative links to '801' and '2112'. A RESTful XML document would contain the set of links to the subordinate pages. A client of the archive.example.com service would have to be prepared to handle disambiguation pages if it used only the author generated GID, but it would be guaranteed that the full url would lead directly to one and only one email message. Archives would have to recognize the X-List-Sequence-Number and honor it whenever it regenerated its archives so that the urls would remain stable. Thinking about this more (and I've been up since about 3:30am so I'm a little foggy right now ;), we may want to optimize for fewer dupes rather than fewer collisions, or maybe it doesn't matter. It would be interesting to see how big the message-id buckets are when only using the Message-ID header. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW 3KeGe2PkpaI= =yhaZ -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 20 Jul 2007, at 15:26, Barry Warsaw wrote: BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time... Obviously Mailman can't know the second and third parts so it can't use them in its list copies. I dislike using YYYMMDD because of the high number of collisions. Its used as part of a UID, but has the nice feature of allowing easy queries as to other messages at that time. If the archiver is local you also have the information for part 2 of the UID - lurker takes it from the From_ line. Nigel. -- [ Nigel Metheringham [EMAIL PROTECTED] ] [ - Comments in this message are my own and not ITO opinion/policy - ] ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nigel, On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 15:26, Barry Warsaw wrote: BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time... Obviously Mailman can't know the second and third parts so it can't use them in its list copies. I dislike using YYYMMDD because of the high number of collisions. Its used as part of a UID, but has the nice feature of allowing easy queries as to other messages at that time. That should definitely be a way to traverse to the message, but it's not the message's global id (a.k.a. canonical address relative to the base url of the message store). An archiver could provide other ways to traverse to the message, such as: /[EMAIL PROTECTED]/ to see all messages by me /[EMAIL PROTECTED]/mailman-developers/20070720 to see all messages by me today to this mailing list /Subject?Improving%20the%20archivessort=thread to find all the messages in this thread regardless of when they were posted etc. If the archiver is local you also have the information for part 2 of the UID - lurker takes it from the From_ line. Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDMInEjvBPtnXfVAQKJFAP/Y3FsBIXrSaRZ85eCl+pVTZxez2uRn0KB 2OMBV6vS/qC8K1R/myeGpBVr44yE/AfTa+kf+MLSlIlMpJdUlWDMWw2G90IPy1gv t1VGrwbVPmOlLFxF8kIsi6NKIZpKoJrJVdQnSc+uPCqowIDU9FQ57+2hrH8HayTS ISAZ0FTgAzk= =sp+m -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 20 Jul 2007, at 15:52, Barry Warsaw wrote: Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? Well lurker handles Maildir - no From_ but the same info is in the filename, and it can take messages on stdin without a From_ - at which point I guess its either faking it (from the headers) or making things up. Nigel. -- [ Nigel Metheringham [EMAIL PROTECTED] ] [ - Comments in this message are my own and not ITO opinion/policy - ] ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote: On 20 Jul 2007, at 15:52, Barry Warsaw wrote: Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? Well lurker handles Maildir - no From_ but the same info is in the filename, and it can take messages on stdin without a From_ - at which point I guess its either faking it (from the headers) or making things up. Cool. I wonder if lurker is compatible with Python 2.5's mailbox.Maildir implementation and whether the two could share the maildirs. Thanks for the information! - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDRw3EjvBPtnXfVAQJHXwP/SiKhWiZ57thW84RBUWt9QVjf4KISEfRJ H5lioRVPYYegiJp7rf/08TutkNsxGCHzRd/cdMEFXMkrCAdifLQ2QIdS4LRvEKyY eRbVHcmxyAlwMbyUq36W+pcH2MutTM64HKNrbL9YRSTaLyMA11FnmaiGIK3RMnbM AqtLGRSJ8Ec= =D8oM -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: But it would have to be subject to the same bounce rules as any other auto-response which could be used as a spam vector, e.g. limit the number of bounces per time period and don't include the entire original message in the bounce But that prevents detecting a prematurely sent message, which is presumably a common use case for genuine collisions. I just don't think bouncing back is going to be very useful; either you don't give the user the information he needs to figure out what happened, or you give the spammers a vector. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
John A. Martin writes: In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when archiving the message rather than leaving it to the outgoing MTA? Quite. My reason for saying last resort is simply that this is not predictable to third parties. Eg, I send you (a non-subscriber) a message with CC and no Message-ID. You'd like to find the thread in the archives. You may as well just do a linear search on that month's threads. An URL based on an MD5 of the message body in theory would work, but in the presence of non-ASCII bodies, structured MIME, ML digests, and various MTA autoconversions, that seems fragile. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 7/3/07, Terri Oda [EMAIL PROTECTED] wrote: I'm trying to remember all the things people have suggested for the archives in the past so I can figure out what needs to be done and what might be nice to have, and see if this is doable in the time I have in the foreseeable future. At lists.indymedia.org, we use a patch that provides these: * stable URLs based on a generated message id * URLs to the archived message in the message headers * message hiding http://lists.indymedia.org/patches/imc-10-mmid_hide_posts.patch It poses a bit of a migration issue since all the existing mboxes may or may not have the mmid header in them. We worked around that by having an special place for the old archives. We've been meaning to move to lurker for years, but haven't had the human resources and also there were some showstoppers: * public/private lists - lurker couldn't do that properly when we looked * lack of date-based index to the archives * general navigation issues; stuff like linking between current thread and nearby ones * mailto links (has now been fixed) * the migration nightmare My personal opinion is that pipermail should be removed and mailman should not contain a default archiver since there are plenty of good archivers already (lurker, mhonarc etc). Adding wrappers around them would be simpler than reimplementing them. -- bye, pabs https://docs.indymedia.org/view/Main/PaulWise ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for fun. And nothing says fun like trying to fix the Mailman archives! ;) That would be awesome Terri! It's an aspect of Mailman that sorely needs attention, and you will gain (even more) fame and fortune by working on it. :) I totally support this effort. A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the archiver used by Debian. So if you want to leverage existing open source archiving or at least look at an example of what would be necessary to allow easy easy external archiving integration with Mailman you might want to look at Lurker. -- John Dennis [EMAIL PROTECTED] ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
On 5-Jul-07, at 12:09 PM, John Dennis wrote: A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the archiver used by Debian. I was hoping someone would post that link! Lurker was best of breed last time I was looking, and I'd definitely like to see what we can leverage there. Terri ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Barry Warsaw writes: - archive links that won't break if the archive is rebuilt Yes, this is absolutely critical, in fact, I'd put it right at the top of the list, even more so than a u/i overhaul. Stable urls, with backward compatible redirecting links if at all possible, would be fantastic. +1. I've been wanting to do something about this, and have made proposals (not back with code, mea maxima culpa) for design. I would definitely be happy to help with this, but given time constraints, it would be nice if somebody else could take the lead. Along with that, I would really like to come up with an algorithm for calculating those urls without talking to the archiver. Brad didn't like this when I suggested it before, but I didn't really understand why not. Anyway, FWIW: I suggest adding an X-List-Received-ID header to all messages. I haven't really thought through whether the UUID in that field should be at least partly human-readable or not, but that doesn't matter for the basic idea.[1] The on-disk directory format would be /path-to-archive/private/my-list/Message-ID for singletons (Message-ID is the author-supplied ID) and /path-to-archive/private/my-list/Message-ID/List-Received-ID for multiples. These would be created on-the-fly when they occur. They can be served as static pages. For almost all messages, the bare URL http://archives.example.com/my-list/Message-ID should Just Work (ie, return a no-such-object result or a single message). Where it does not, you get an index of all pages with that message ID. The main drawback to using Message IDs that I can see is that broken MUAs may supply no Message-ID, or the same one repeatedly. In the former case, as a last resort Mailman can supply one, but that won't help people who get a personal copy and want to find the thread. However, I see no way to help them, anyway, beyond a generic archive search engine. In the latter, you get lots of messages matching the Message-ID, and while most lists should have *zero* problems, a list that has any instances of this problem would have many. Again I can't see a good way to deal with this other than a general search facility, as computing a digest of headers or content is hard to do reliably. Providing an index of matching posts seems like a reasonable approach, which can be efficiently implemented (eg, as static pages). Furthermore, the examples I've seen of both in the last few years have all been either spam or (in the case of duplicate Message-IDs) actual duplicates due to some mail system problem or itchy user fingers. A minor drawback to my proposal is that if a message gets archived as a singleton for that Message-ID, then a duplicate arrives, previously created references in the archive will of course now return an index rather than the desired message. Ie, there is data corruption. This can be dealt with in several ways; the easiest would be to provide a if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me link when creating the directory for multiple instances. There's also a *very* minor benefit: repeat sends will be immediately recognizable without checking Message-ID. Footnotes: [1] By partly human-readable I mean containing list-id and date information. The idea would be to have the date come first, so that users would have a shot at identifying which of several messages is most likely, and this would be searchable by eye with simply an ordinary sorted index. ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
st == Stephen J Turnbull Re: [Mailman-Developers] Improving the archives Wed, 04 Jul 2007 16:49:58 +0900 st The main drawback to using Message IDs that I can see is that st broken MUAs may supply no Message-ID, or the same one st repeatedly. In the former case, as a last resort Mailman can st supply one, If the archive is considered to be a reflection of what Mailman _put_ on the wire, as distinct from what was received from the wire, then adding a Message-ID in the absence one already present is a reflection of a SHOULD requirement of rfc(2)822. In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when archiving the message rather than leaving it to the outgoing MTA? jam pgpQL0SZvNpJX.pgp Description: PGP signature ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
I'm all for someone taking ownership of this long-neglected component -- thank you for doing so! Barry Warsaw wrote: Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. The resistance to basing this on message-id has always been that there's no guarantee of uniqueness... ...but I believe each list has some sort of counter for how many messages it's seen, so we could add another header with that number, and use as a unique id the two concatenated together... (That way the archiver can know from the content of the header exactly how to generate the same unique id as mailman, which would allow for the url-in-the-footer to happen w/o first hitting the archiver.) Just throwing out ideas, -Dale ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. I'd suggest the reverse. Keep the canoncical archive URL short and sweet, and then use a URL redirection service to map message-id's to those URLs. It is the archiver's job to make it all work. For example, the canonical archive URL might stay exactly the way it is in pipermail. But the archival link embedded in the message would instead go to a redirection service. http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html http://mail.codeit.com/[EMAIL PROTECTED] The one other thing I'd ike to revisit is integration with third party archival services. There are two obvious integration points; one is a button in the Mailman list admin user interface that says archive with service X not unlike the setting in Firefox that basically says search with service X. The other integration point is the archival link discussed above. In which case it would be set to something like. http://third-party-service/[EMAIL PROTECTED] Disclosure: I help run a third party archiving service, and this topic was discussed quite a bit previously. [1] Nonetheless it seems like a good time revisit given the current discussion about archive wishlists. [1] http://www.mail-archive.com/mailman-developers@python.org/msg08772.html ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
In which case [the message body link] would be set to something like. http://third-party-service/[EMAIL PROTECTED] Just for fun, I did a trial implementation. It works, but the URLs are too long. For example, the URL below spends 59 characters on the messag-id, and 27 characters on the listname. We're already over my comfort level (of about 72 characters) and haven't even started to count the hostname, and other URL-lengthening overhead. Maybe this was a bad idea after all. http://www.mail-archive.com/search?l=mailman-developers%40python.org[EMAIL PROTECTED] Jeff ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
I'll admit to not having read previous discussions on this topic, but I'll also add my 2 insert-lowest-denomination-coin here: On 7/2/07 11:06 PM, Terri Oda wrote: - better address obfuscation (maybe by generating pages through cgi) I run a few Wordpress sites, and there's a plugin I use called PHPEnkoder which does a good job of this. It basically wraps the address around a little bit of Javascript; if you have Javascript turned on in the browser, it's seamless, and if not you see Javascript required to view address or something like that. The theory is that bots and such don't run JS, so it's safe from harvesting. I'll leave it to the list as to how true an assessment this is, but it Works For Me : * Add a search option I know there's been patches around forever that integrate ht://Dig with Pipermail; maybe some way to do this, while still making it an option that can be tuned? If ht://Dig is there and you turn on the option, it works, but if it's not then it's not required? This would satisfy the not adding a billion dependencies, but may be overkill as well. I'll also happily admit to not knowing much about the cost of search engines to a system. * MUAs usually make URLs clickable. An new Archive could be used when posts are distributed, in the footer, so that each message has a link to the whole thread in the Archive. This would be a Godsend. A group at work here runs an old homebrewed exploder, and a few years ago I tried to convert them to Mailman. They liked everything they saw, up until the point where they couldn't refer to some kind of short and simple message number, and get right to that message in the archive. The current system generates a number based on a simple incrementing index of the list, and many months after a mailing people will refer to message #483, and know they can view it at http://hostname/foo/listname/483.html - which is also posted in the footer of the message sent out. Of course, if the archives were based on Message-ID headers, this may make such a number a bit unwieldly, but if it were some kind of simple-ish system I might finally get rid of those old lists : -- Steve Huston - W2SRH - Unix Sysadmin, Dept. of Astrophysical Sciences Princeton University |ICBM Address: 40.346525 -74.651285 126 Peyton Hall |On my ship, the Rocinante, wheeling through Princeton, NJ 08544 | the galaxies; headed for the heart of Cygnus, (609) 258-7375 | headlong into mystery. -Rush, 'Cygnus X-1' ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for fun. And nothing says fun like trying to fix the Mailman archives! ;) That would be awesome Terri! It's an aspect of Mailman that sorely needs attention, and you will gain (even more) fame and fortune by working on it. :) I totally support this effort. I'm trying to remember all the things people have suggested for the archives in the past so I can figure out what needs to be done and what might be nice to have, and see if this is doable in the time I have in the foreseeable future. The big things people wanted most, if I recall correctly, included: - modernized HTML/CSS/Themes (preferably to match a modernized web interface... is that all set up now?) It's not, but Andrew Kuchling will be working on this. I haven't yet revealed detailed plans, though I'm working on an email about this over the U.S. July 4th holiday. But I suppose it's time for a quick summary: I'd like to get a Mailman 2.2 out with an updated u/i sooner rather than later, and if possible an updated archiver would be one of those few other new features that I think could go into a 2.2. OTOH, it would be fine if we pushed that off to Mailman 3 too, but it leveraged all the u/i work to be done in 2.2. - archive links that won't break if the archive is rebuilt Yes, this is absolutely critical, in fact, I'd put it right at the top of the list, even more so than a u/i overhaul. Stable urls, with backward compatible redirecting links if at all possible, would be fantastic. Along with that, I would really like to come up with an algorithm for calculating those urls without talking to the archiver. This would allow the list delivery queue to calculate the List-Archive: header value and any message header/footer substitutions before the message hits the archiver. - better address obfuscation (maybe by generating pages through cgi) I'd still love to do this, and I think were it not for crawlers, we could get a lot of mileage out of creation on demand and caching. But how do you handle Google crawling your archive? - search Another huge huge feature. - not adding a billion dependencies to Mailman Definitely. I'm also not opposed to changing the interface between Mailman and the archivers if necessary. Here's the list from the wiki's Mailman 2.2 page: http:// wiki.list.org/display/DEV/Mailman+2.2 We should probably start a separate archiver wiki page. I plan on re- organizing the 2.2 page anyway, so I'll probably end up doing that if you don't get around to it before me wink. (1) Is anyone working on this already? Not that I know of. (2) What else is on people's wish lists for a pipermail replacement? Other things high on my list are ditching the crufty storage currently being used (pickles begone!), an RSS feed, and a 'message storage' which could be used to vend archived messages through other delivery transports, such as imap or nntp. But I'd be willing to put all that off for stable urls, an updated u/i, and searching. Anything I can do to help, please let me know. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRorkOHEjvBPtnXfVAQLw0wP/TFgXxFAcK+3QiDG4jkyPCVVpP0EqATwB nYfUDrf0ytuTphFMM4gJmWbZdtR1HJ2xqNOit18QTsM/pjTiIDB++nH0IoRkRwy3 qs4JdBb+m3Amuxaaa4dQp+nWQt2yUMsF/HWp3BS/vx8oCfkjMhOKDI29/UG9jU+L L64QzWeywGw= =ewlo -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp
Re: [Mailman-Developers] Improving the archives
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Steve makes me think of a couple of other wish list items. On Jul 3, 2007, at 7:36 AM, Steve Huston wrote: On 7/2/07 11:06 PM, Terri Oda wrote: - better address obfuscation (maybe by generating pages through cgi) I run a few Wordpress sites, and there's a plugin I use called PHPEnkoder which does a good job of this. I have this idea that you could gateway messages from an archive or mailing list to and from a bulletin board forum. Maybe this doesn't fall within the scope of the archiver because I could see a 'forum queue' like we have an nntp queue, but in that case, being able to calculate an archive url without talking to the archiver becomes important again. It would be nice in that case to put a link to the archive message in the forum post. * MUAs usually make URLs clickable. An new Archive could be used when posts are distributed, in the footer, so that each message has a link to the whole thread in the Archive. This would be a Godsend. A group at work here runs an old homebrewed exploder, and a few years ago I tried to convert them to Mailman. They liked everything they saw, up until the point where they couldn't refer to some kind of short and simple message number, and get right to that message in the archive. This reminds me, I would love to have a link in an archive message that I could click to get the message sent to me, as it originally appeared on the mailing list. If I had that, I'd never need to locally save another mailing list post. I'd just search for the one I wanted, go to the archive, click on the send it to me link, then do a normal reply in my mail reader. The current system generates a number based on a simple incrementing index of the list, and many months after a mailing people will refer to message #483, and know they can view it at http://hostname/foo/listname/483.html - which is also posted in the footer of the message sent out. Of course, if the archives were based on Message-ID headers, this may make such a number a bit unwieldly, but if it were some kind of simple-ish system I might finally get rid of those old lists : This would be possible with today's system, but it leads to unstable urls, especially when you consider archive scrubbing (which, come to think of it, is another wish list item ;). We'd like for an admin to be able to easily pull an archive message, but it's even worse than that. Sometimes an admin has to scrub the actual backing message store (e.g. today's mbox file). This will change the message counts and thus the incremental indexes. Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. - -Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRormRHEjvBPtnXfVAQIHYwP/fLnY/pebRlhrFeUpPJu5VfZNyR24oLId qjZ4F2MHW25LcemvGzpeUSgXRQJk2LQIQKSlYYtTM+8xcStey4IvDnPLmzX5MQOC xiI9PznZHdLmbF9SaUDZQZBRKZhqCNeslZ5zpnN35KStL3NlTc6PkBylzIC7Y47F a3RxMEOgMaA= =HM9I -END PGP SIGNATURE- ___ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp