Re: [Mailman-Developers] Improving the archives

2007-11-03 Thread Jeff Breidenbach
but if you can trust yourself to generate them, consecutive
integers provide minimal, order-preserving, perfect hashing, too!

Hmm this sounds pretty sensible to me.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-11-03 Thread Stephen J. Turnbull
Craig Loomis writes:

 Globally unique IDs, hashed IDs, etc., are very appealing from  
  various CS-y and techie points of view, but are simply not memorable  
  to humans or knowable by dumb external programs. I think as much, or  
  more, effort should be put into delivering a straightforwardly useable  
  naming scheme as goes into making an arbitrary message recoverable  
  from anywhere.  Basically, friendly URLs should be a primary  
  requirement, not an optional afterthought for careless geeks like me  
  to get wrong later

Friendly URLs *are* a primary requirement.  The point is that to make
them *reliable* as well, either a globally unique ID is needed, or
individual site admins must suffer through hard-to-document
constraints on what they can do with their archives.  Note that the
system you describe based on the post_id member demonstrates the value
of a unique ID.

Sufficient reliability is not a tough requirement for an individual
admin to achieve, as you have demonstrated.  It's much more exacting
for the Mailman developers, who need to satisfy both sites with
different needs *and* archivers with different features.

 As an aside on other discussions, can you get away without using  
  Message-ID or Date?

No.  Not all recipients of the messages get them through the list.
Once again, Mailman developers have to consider that situation, while
in your situation you may not need to worry about it.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-10-30 Thread Craig Loomis

   Or Re: [Mailman-Developers 10417] Improving the archives

   I would like to interject and highlight some use cases for stable  
and predictable IDs. For us, message IDs are directly used both by  
people and ignorant programs. Our mailing lists serve as a permanent  
and concise record of our discussions, decisions, and operations, and  
we find it invaluable to be able to refer to individual messages in a  
simple and memorable way: message 1210 in the calibration list, say.  
Other people can then easily jot that info down or directly find the  
message. Some message IDs even become shorthands for a particular  
topic or decision. We have also added trac InterWiki templates  
pointing into our mail archives (as listname:number), which encourages  
desirable cross-referencing (PRs, wiki pages, and SVN change logs can  
refer to mail messages, just as wiki pages could always refer to  
changesets and PRs, etc, etc.)  But trac InterWiki templates can only  
interpolate $1,$2,... arguments into strings, and could not possibly  
calculate anything based on the _content_ of the messages.
   Globally unique IDs, hashed IDs, etc., are very appealing from  
various CS-y and techie points of view, but are simply not memorable  
to humans or knowable by dumb external programs. I think as much, or  
more, effort should be put into delivering a straightforwardly useable  
naming scheme as goes into making an arbitrary message recoverable  
from anywhere.  Basically, friendly URLs should be a primary  
requirement, not an optional afterthought for careless geeks like me  
to get wrong later

   We long ago added an extremely simple ID handoff between MM 2.1.8  
and pipermail, and though imperfect it has served us well. Basically,  
we hijacked the .post_id member in mailman (otherwise basically  
unused, and mysteriously a floating point number); CookHeaders stuffed  
it into a X-Mailman-Sequence-ID header line, and AfterDelivery  
incremented it. In turn, pipermail uses the header to feed a sequence  
ID into make_article, and the message is squirreled away as  
$mailinglist/all/%d.html. There are a few other minor matters (e.g.  
post_id was added to Decorators, a couple of templates were changed,  
we lost having 'ls' sort chronologically [did we have to add .last  
and .prev to the HyperDatabase classes?]), but it really was a minor  
bit of work. And for stability, as long as the archive files aren't  
lost, pipermail rebuilds should yield the same URLs even if junk  
messages have been deleted. [Oh, we did also add a never rotate  
policy to our archives, but that is finesseable. ]
   As an aside on other discussions, can you get away without using  
Message-ID or Date? I.e., aren't those just more of those tokens which  
were standardized back before the Internet got tricky enough to  
invalidate the standards? Mailing lists serialize incoming messages,  
and so can generate their own unique and trustworthy IDs. UUIDs  
would work, but if you can trust yourself to generate them,  
consecutive integers provide minimal, order-preserving, perfect  
hashing, too!

   Anyhow, we have found that people will enthusiastically refer by  
name to individual messages within mail archives if they can.

  - craig

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-10-03 Thread Jeff Breidenbach
Question: what about crossposted messages?

Let's say a message gets sent to a list called mailman-developers
with a CC to a list called pet-bunnies. Hypothetically, of course.
Presumably, the person who got the message from pet-bunnies
should probably end up at the pet-bunnies archive, where the
message can be viewed in proper context; right before the
processed carrots flamewar and after the manifesto on proper
hopping technique. To make that work, I think we need some
way to - at least optionally - allow one or more of the RFC 2369
headers to influence the archival URL. Reading the wiki, I guess
that's where List-Archive comes into play?

My other question is about the angle brackets. Barry, why are
you inclined to include them in calculations? It's kind of arbitrary,
but quoting RFC 2822, end of section 3.6.4:

   Semantically, the angle bracket characters are not part of the
   msg-id; the msg-id is what is contained between the two angle bracket
   characters.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-10-03 Thread Ian Eiloart


--On 2 October 2007 22:47:35 -0400 Barry Warsaw [EMAIL PROTECTED] wrote:

 One question: should the angle brackets on the Message-ID  be part of
 the hash or not?  I think they should, or IOW, the entire value of
 the Message-ID header is taken as the hash, though they should be
 stripped off if using the Message-ID in any kind of archive query.
 I'm open to suggestions though... comments?

Mathematically, the two solutions are equivalent for valid headers, aren't 
they? OK, the hashes will be different, but only in a trivial sense.

Technically, I imagine, it's going to be easier to handle bogus headers if 
you just hash the entire header. For example, what do you do if some piece 
of crapware gives you a message with a header missing the angle brackets? 
Or that adds something outside angle brackets? Or that includes a 
right-angle bracket in the message-id itself?

You don't have to think about any of those situations if you either (A) 
reject the message or (B) encode the entire header.

-- 
Ian Eiloart
IT Services, University of Sussex
x3148
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-10-02 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote:

 Jeff Breidenbach wrote:
 5.85 million messages

 That's 0.03% if you count all the messages. It is 0.008% if you
 discard the top three offenders, all of which I have contacted.

 I'd say that's a strong argument for just using the Message-ID and
 simplifying this tremendously...

 ...Barry, do you disagree?

No, I'm convinced.  Apologies for taking so long to respond.  The  
code in the Mailman 3.0 branch has been updated to use only the  
Message-ID.  I still think the base32-encoded sha1 hash is a good  
user-friendlier option but of course and that archivers should accept  
either.

One question: should the angle brackets on the Message-ID  be part of  
the hash or not?  I think they should, or IOW, the entire value of  
the Message-ID header is taken as the hash, though they should be  
stripped off if using the Message-ID in any kind of archive query.   
I'm open to suggestions though... comments?

 (It can still be a base32 encoded SHA hash it to make it less user  
 hostile.)
 http://wiki.list.org/display/DEV/Stable+URLs

The wiki is down at the moment (I have a issue opened on the support  
tracker about that).  When it comes up, I'll update the page.

Thanks everyone for a very good thread, and especially for Jeff for  
doing the analysis on real data.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFHAwLI2YZpQepbvXERArC8AJ9xJAtqHQPwipUnZuMOvkQ2yxWa0QCbBf+D
KnPkuOJEFTZD38BfupCLvk0=
=/kr1
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-08-07 Thread Jeff Breidenbach
 What we really want to know is how many (non-empty) Message-ID
 collisions are there that *don't* share a Date?  This is the number of
 messages that only-messageid loses, and that the composite identifier
 method would not lose.

I took a look at a larger dataset, 5.85 million messages from several
thousand lists. Of the messages that share message-id but not date,
most come from a small number of based web services.

  875 come from forums.slimdevices.com
  378 come from lists.openplans.org
  265 come from nabble.com
  164 come from egroups.com
  135 come from yahoo.com
  166 come from elsewhere

That's 0.03% if you count all the messages. It is 0.008% if you
discard the top three offenders, all of which I have contacted.
I didn't try contacting Yahoo/eGroups because in my past
experience, talking to a brick wall is easier. I have not analyzed
how many of these messages are spam or have duplicate bodies,
which further discounts the percentages.

Hope this data helps.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-08-07 Thread Dale Newfield
Jeff Breidenbach wrote:
 5.85 million messages

 That's 0.03% if you count all the messages. It is 0.008% if you
 discard the top three offenders, all of which I have contacted.

I'd say that's a strong argument for just using the Message-ID and 
simplifying this tremendously...

...Barry, do you disagree?

(It can still be a base32 encoded SHA hash it to make it less user hostile.)
http://wiki.list.org/display/DEV/Stable+URLs

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-08-01 Thread Jeff Breidenbach
 704 messages fall into this category. Of these, 596 come from a
 single (malfunctioning and duplicate spewing) list server. I have
 not yet examined the remaining 208 messages, but I'll bet anything
 many also have duplicate message bodies. Or are spam. So for this
 data set, we have an upper bound of 0.01% messages in this
 category, possibly significantly less.

Correction.

... remaining 108 ... 0.005% ...
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-08-01 Thread Jeff Breidenbach
 What we really want to know is how many (non-empty) Message-ID
 collisions are there that *don't* share a Date?  This is the number of
 messages that only-messageid loses, and that the composite identifier
 method would not lose.

It took longer than expected, but I now have numbers from
looking at 2,151,896 messages spread over a few thousand
lists. The appended script was run over a set of MH format
raw messages.

704 messages fall into this category. Of these, 596 come from a
single (malfunctioning and duplicate spewing) list server. I have
not yet examined the remaining 208 messages, but I'll bet anything
many also have duplicate message bodies. Or are spam. So for this
data set, we have an upper bound of 0.01% messages in this
category, possibly significantly less.

Jeff


#!/bin/bash
#
# Look for messages that
#
# Do collide with message-id
# Don't collide with message-id + date

DIR=/home/archive/Mail

C1=0
C2=0

get_ineresting_messages() {
cd $DIR/$1
for j in $(ls -U); do
MSG_ID=$(cat $j | 822field message-id)
MSG_DATE=$(cat $j | 822field date)
if [ $MSG_ID !=  ]; then
echo $MSG_DATE | $MSG_ID
fi
done |\
sort |\
uniq --separator='|' --skip-fields=1 --all-repeated |\
uniq --uniq
}


for i in $(ls $DIR | grep @); do
DUP=$(get_ineresting_messages $i)
DUP_CNT=$(echo -n $DUP | wc -l)
MSG_CNT=$(cd $DIR/$i  ls -U | wc -w)
C1=$(( C1 + MSG_CNT ))
C2=$(( C2 + DUP_CNT ))
if [ $DUP_CNT != 0 ]; then
echo
echo === collisions/messages: $C2/$C1 $i
echo $DUP
else
echo -n . 12
fi
done









 -Dale
 ___
 Mailman-Developers mailing list
 Mailman-Developers@python.org
 http://mail.python.org/mailman/listinfo/mailman-developers
 Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
 Searchable Archives: 
 http://www.mail-archive.com/mailman-developers%40python.org/
 Unsubscribe: 
 http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org

 Security Policy: 
 http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Dale Newfield
Jeff Breidenbach wrote:
 So I just looked at 2 million raw messages from 2007, spread over
 a few thousand mailing lists (all data is from mail-archive.com). My
 first question was - when comparing only with messages from the
 same list - how many times do I see a repeated message-id? The
 answer was ... drumroll please ... 260 thousand. What the hell?

I think the question you were originally going to ask got sidetracked. 
If we assume that all these multiple paths from list to archive 
duplicates not only share a Message-ID but also a Date (they were the 
same message originally, so they should!), then both schemes (messageid, 
and messageid+date) would decide that all (but one of) these messages 
are redundant.

What we really want to know is how many (non-empty) Message-ID 
collisions are there that *don't* share a Date?  This is the number of 
messages that only-messageid loses, and that the composite identifier 
method would not lose.

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Jeff Breidenbach
 If you improve the script or find numbers that lead to different
 conclusions, now's the time to know!

Live and learn!

So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only with messages from the
same list - how many times do I see a repeated message-id? The
answer was ... drumroll please ... 260 thousand. What the hell?

Time for a closer look. In some cases, the archiver was getting two
copies of every message. For example, the MLM (mailman) was
sending out a message to subscriber A and subscriber B, and both
paths eventually lead to the archiver.

In another case, the MLM (YahooGroups) spammed 20 copies of the
same message to every subscriber, and modified the body of each one.
YahooGroups tends create HTML mail and sticks ads, possibly spyware,
and who knows what other crap in message footers.

There's probably other categories I haven't noticed yet, 260k messages
is a lot of checking. So you'd think the archives would be a complete
mess. But they aren't and I had no idea anything was remotely amiss
under the hood. That's because mhonarc only archives one message
per message-id. So those 19 repeats from YahooGroups get thown away.
This is actually a pretty robust strategy when you think about it; it keeps
lots of annoyances out of archives and everyone who gets smited
deserves it; accidental duplicates, malicious duplicates, broken mail
transfer agents. Reasonable people can disagree, but I like it.

So I'm amending my request. If mailman and pipermail++ want to
keep a verbatim record of everything passing through the MLM, fine.
But please make it also possible to interoperate with archivers that
use the looser mhonarc strategy, e.g. allow the interoperability URL
to collide when message-ids collide. Currently Stephen's proposal
allows this, Barry's does not.

Just to make things really concrete, here's an example from that
YahooGroups collision I was describing. The 20 messages spammed to
subscribers would all have a interoperability URL something like this
(but perhaps not quite so enormously long) embedded in the
message, in both headers and possibly a footer.

http://www.mail-archive.com/search?l=estika%40yahoogroups.comq=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id

Clicking on it, the user goes to the archive server. For this particular
archiver, an HTTP 302 redirect takes the user to another URL which
happens to be more human friendly. But the details of what alternate
URLs are available - if any - is really up to the archive server.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg01341.html

I think that's about it. I do kind of like Stephen's suggestion of
allowing the archiver to supply a formuia for interoperability URL;
if that's the case I'd say the RFC2369 headers could be fair game
for use in the calculation. That allows cross posted messages to
easily link to their correct archive - note how I used the contents of
List-Post when creating the interoperability URL above.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-26 Thread Jeff Breidenbach
 If you are relying on the sender to do the right thing, then
 why not force them to create proper message-ids?

I think Barry's proposal is essentially a numbers game - e.g.
he's hoping for significantly better results using Date in
the calculation than not using it.

http://wiki.list.org/display/DEV/Stable+URLs

I'll try to tease out some more useful stats from some large
datasets this weekend. (I can't just run the python scripts as is
because I don't have python 2.5 in the same place as the data,
I don't keep raw message in mbox format, blah blah blah, but
we'll figure it out).

My hypothesis is Date doesn't really buy much, but that's
in part because I have a vested interest in that outcome.
We'll see how the data plays out. And I still think RFC2369
headers are needed in the calculation if cross posted
messages are to be handled correctly.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:

 On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
 So we just specify a header to put it in, and subscribers will be
 able
 to use it, per definition of a canonical URL.
 It is the archive server's job to decide what is the canonical URL
 for a message. There's a good chance these archival URLs will be
 served by an HTTP redirect. So let's not use the word canonical. :)

 Someone already pointed out that the message ID is a bit long for a
 URL, so I'm guessing we're going to want some sort of shorter
 sequence number for messages for linking purposes.

Yes, definitely.  What do you think of the base32 examples I have on  
the wiki page?

 Regardless of whether we *need* to generate our own unique ID, I'm
 leaning towards the thought that we're going to *want* to generate
 our own for usability reasons.  In a perfect world, i think we'd have
 a sequence number so I could visit http://example.com/mailman/
 archives/listname/204.html and know that 205.html would be the next
 message to that list, but any short unique id would do if sequence
 numbers are too much of a pain.

 It seems silly to generate nice short links but then use message-id.
 If we can generate nice short links, we might as well use 'em
 throughout, unless you really think the default use of the archive
 will be to search it by messageid (which I sincerely doubt, from my
 user experiences).

We'd want sequence numbers in the urls if we think people will hand  
edit them, say in a browser location bar.  I'm not sure that's a  
common enough use case.

Pipermail currently uses sequence numbers but there are big problems  
with that.  First, the mbox'ing algorithm wasn't always correct so  
while sequence numbers were accurate when generating the html  
archives on the fly, they broke horribly when you try to regenerate  
them from an mbox file.  It's also why we have tools like cleanarch  
which tries to unbreak earlier mboxing bugs by crufty heuristics.   
This /might/ be solved by ditching mboxes for maildir or some other  
canonical raw archiving format (not a bad idea in its own right), but  
manual surgery on the raw archives could still break it.  Sometimes  
site admins just /have/ to remove messages, disrupting the sequencing.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdK2XEjvBPtnXfVAQKfDQP/ToPZ3t7+uIyMrsThOr+PVQ7aKVT/BQ7F
OgKqFSDSma4ZofQOkPgr4ZFRT1yKRURWas7jI2zQ8ADPAOKCYh0Udgq6XjpOI8mI
7/pODazVkbwzT9Oo06pGwpzaONK4eZjt1y9IDb9VkniUcAyve5EQ+5+KaG3rbo4M
wsrCnHLkvSE=
=/z/f
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote:

 Regardless of whether we *need* to generate our own unique ID, I'm
 leaning towards the thought that we're going to *want* to generate
 our own for usability reasons.  In a perfect world, i think we'd have
 a sequence number so I could visit http://example.com/mailman/
 archives/listname/204.html and know that 205.html would be the next
 message to that list, but any short unique id would do if sequence
 numbers are too much of a pain.

 I agree there's a lot of usability benefits from short URLs, but  
 perhaps
 this is the job of the archive server, and not the list server.  
 Mharc (an
 archive server) is a great example here. Mharc's canonical message
 format is pretty human friendly.

 http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html

 Unfortunately, there's no trivial way for the list server to know  
 that human
 friendly URL when the message is sent out. Fortunately, Mharc is also
 happy handles messages by message-id, which the list server does know
 about.

 http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc- 
 users[EMAIL PROTECTED]

 Had I been the implementer, I'd probably have made mharc do an HTTP  
 302
 redirect from the longer URL to the shorter URL. But that's besides  
 the point.
 The point is we have an existing, working, happy archival server,  
 and it would
 be really nice if list servers (such as mailman) were compatible.  
 And by
 compatible, I mean offering the capability of embedding an archival  
 URL in the
 footers of messages.

I agree, I just don't think message-ids are user friendly enough to  
be this canonical url.  Especially in this context, which is exactly  
where urls are thrown in users faces.  An archiving service is  
exactly the right place for redirecting human readable urls to the  
archiver's canonical url (by, I agree, 302).

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdLznEjvBPtnXfVAQJtxgQAiLp7TjnLoOLnpoxfli2gBo6fdU6ZIFb0
SKiuRgLAoTSdnJymYWOww2U/vTJ3HqR2dZNFCfGeVHgzoHpiX87WiZDJ4Sx1Jec8
7BpIO1ZokGI2NhHiSscYC5k4iCzce17lVGkyVzfYlFysmFKsFjcDIpV8wQFleeG9
TneLaMXT2eY=
=1tKI
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote:

 So we just specify a header to put it in, and subscribers will be  
 able
 to use it, per definition of a canonical URL.

 It is the archive server's job to decide what is the canonical URL
 for a message. There's a good chance these archival URLs will be
 served by an HTTP redirect. So let's not use the word canonical. :)

 If it's not going to be canonical (I forget if there's a standard
 for that word :), what is the point in writing an RFC?

I completely agree.  Maybe interoperable is the right word to use.   
Or user friendly interoperable archive url which is really what  
we're trying to define here (IMO).

 There needs to be a way to *enforce* uniqueness, and it *must* be
 specified by the RFC in order for archive implementations to be
 interoperable.  Note that word specify; I do not insist that this
 level of robustness be *required*.  But if we don't specify it now,
 people who want such robustness will have to do all this work again,
 and possibly will end up with something that some servers conforming
 to your RFC will not conform to.

Yep.

 It is possible that most archivers will simply use the message ID, and
 do something brutal in the rare case of a collision.  That's fine.
 But an archiver that wants to provide a canonical URL which is
 guaranteed to uniquely and losslessly identify a post in its archive
 should have a standard way to do that.

Yep.

 The main thing that bugs me is message-ids are long, which makes
 them awkward to embed in a URL in the footer of a message.

 The footer URL is of no concern in this discussion.  There is not
 going to be a requirement that footer URLs be canonical, not if I
 have any say in the matter.  The canonical URL will be in (or be
 constructed from) the message header.

Agreed in the sense that the RFC 2822 headers must contain all the  
information necessary to construct the canonical url (or must contain  
the canonical url).  A list server /can/ decorate the message with  
the url in other ways, but that certainly isn't necessary.

You might even imagine a mail reader extension that read the  
appropriate List-* headers and added a button View In Archive which  
sent the canonical url to your web browser.  Once that happens, the  
archive service is free to redirect to its hearts content.  I submit  
though that any good archive service (and certainly Pipermail++ if I  
can help it) will ensure that those urls are stable forever,  
otherwise people will stop relying on it.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdNWnEjvBPtnXfVAQIZRAP/Ux9rUK6ToH5Zl2XTC8LOKgCG+1yhf4pw
h4XVZc0nmP1xxFttsXzsuY+/oGFW8yrY0yGnxK4N5EKUEpIxejGNbVtAjpQ5l/Sy
ml5R5kDhZtk/d8tE9IXOzB5zCcxdmMgjX3KfL78t5L6JzAQ4RgM0MTYxPH69AdHW
zpvhBCow/z8=
=KiqU
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:

 What you gain from my proposal over a pure Message-ID approach
 is guaranteed uniqueness given the list copy

 Guarantee is a pretty strong word. A malicious person could post two
 messages with the same message-id, same date, but different bodies.

No question, if the archive service and the list server are not  
intimately connected, the communication channel between the two can  
be subverted.  There are ways that channel's trust could be enhanced  
though, for example by the list server signing its headers in a dkim- 
like fashion.

But in situations where the two are co-located, you can trust these  
headers even without that enhancement.

 So that moves us to how many collisions are reduced in practice.
 I have a question about the numbers Barry mined from the python
 lists. Are the collisions really that high? One should not count
 messages without a message-id, because the MLM can and should
 create one in that case.

I've uploaded the script I used to here:

http://wiki.list.org/download/attachments/786633/scan.py?version=1

It's probably not perfect, and certainly the python.org mbox's may  
not be representative enough of the real world.  Please grab the  
script, tweak it and run it over your own raw archives; it should be  
easily modified to handle any of the mailbox formats supported by  
Python 2.5's mailbox module.

If you improve the script or find numbers that lead to different  
conclusions, now's the time to know!

 and human friendlier urls.

 That's a very compelling point.

 SHA1 can't be computed inside someone's head or simple cut-n-pasted
 together for old messages,  but I think the usability benefits of  
 short
 URLs (short enough that they can comfortably fit inside message  
 bodies)
 outweighs this drawback. By the way, is SHA-1 still in favor? My
 impression was it was fading away after the Shandong University team
 partially cracked it.

We're not concerned with the cryptographic security claims of SHA1.   
I don't see any economically beneficial attack on the archives  
against SHA1 here.  I think SHA1 is reasonably universally available,  
and marginally better than MD5, so it's probably good enough for this  
application.

You're right that no one is going to do SHA1 in their heads, and if  
they could, they're probably working for some TLA in a secret gubmit  
basement lab somewhere.  The point of course is that a /program/  
could easily apply the algorithm to a very minimal existing message  
and come up with the same canonical url.  This enables all kinds of  
cool applications based on REST-y principles or whatever.  The fact  
that the algorithm leads to short(ish), largely unambiguous (to  
humans), readable urls is an important benefit -- probably /the/ most  
important benefit.

 Throw it away or hide [Date]?  The former would be a problem,
 but not the latter.

 Thrown away.

Really?  Wow.  I'd have thought every archiving service would want to  
keep a record of the raw message it received on the wire.  That would  
allow it to regenerate the html archive if necessary, provide useful  
forensics, and allow for exactly the kind of data mining we're doing  
here.  I can't see /any/ reason for not saving the raw messages in  
their entirety, especially for a public list.  Maybe for a private  
one, where your data retention policies require you delete things  
after a certain amount of time, but even there, I can't see why you'd  
want to trim raw messages rather than just chucking them entirely.

 My favorite archival service is based on mhonarc,
 and raw mail goes into offline cold storage.

What's the advantage of that?  Isn't disk space cheap as dirt?   
Probably cheaper if you've bought any topsoil recently :).  Still,  
the raw messages are still available right?  So if there was enough  
value in calculating the canonical urls so that the archive service  
could be seen as an interoperability good citizen, then it could be  
done.

I'll just reiterate that I'm not married to including the Date header  
in the algorithm.  Until proven otherwise by more research, I think  
it's a good idea to use because 1) it's required by RFC 2822 and 2)  
it seems to reduce collisions.  I think the algorithm I propose would  
work just as well with Message-IDs alone, although there's more of a  
chance that the non-sequence numbered url will return multiple matches.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ
iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2
KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad
ERlOYR2onAQ=
=8b8I
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: 

Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Jason Fesler
 Guarantee is a pretty strong word. A malicious person could post two
 messages with the same message-id, same date, but different bodies.

This is my concern too.  Especially since this is known information; it is 
trivial to be malicious.  Whatever was done, I think would *have* to deal 
with 'dupes', in some form or another.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Gustav H Meyer
Hi,

I think this is the first time that I'm posting here but hopefully
not the last. Thanks to everyone involved for an incredible project.
I'm not much of a developer but I like practical solutions and will
do everything possible to help improve in this area even if it's
just to give some feedback.

I'm very excited about this project and can't wait for the next
version to come out with full integration between web forum and
mailing list. I like this idea very much and it seems that we're
going to see it real soon. :)

On 24/07/2007 18:43, Dale Newfield wrote:
 Jeff Breidenbach wrote:
 In addition, Barry was talking about concocting a unique
 identifier from the Date field and Message-ID. I'm not a big fan of
 this idea, because the date field comes from the mail user agent
 and is often wildly corrupt; e;g; coming from 100 years in the future.
 
 Oh--I was assuming the Date to which he was referring was the current 
 timestamp at which mailman was processing the message.  I was going to 
 say that this guarantees uniqueness, but I guess there are parallel 
 mailman implementations where more than one machine/processor are all 
 serving the same list, and then two different machines/processors might 
 wind up with identical timestamps while processing two different messages.

I also like the idea of seeing the date somewhere in the URL but
IMHO we also need to see a unique sequential number. How about the
following idea:

http://my.list.server/archivebase/mylist/200707240001/msg1/
http://my.list.server/archivebase/mylist/200707250001/msg2/
http://my.list.server/archivebase/mylist/200707250002/msg3/

and at the same time allow the following:
http://my.list.server/archivebase/mylist/msg1/
http://my.list.server/archivebase/mylist/msg2/
http://my.list.server/archivebase/mylist/msg3/

This way you can see exactly how many messages were sent on a day
and how many messages have been sent since the start.

BTW the sequential number does in my view not have to be a decimal
value. Anything short and sweet will do as long as you can work it
out and at the same time allow for almost unlimited growth.

Just an idea.

Regards,
Gustav H Meyer
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Stephen J. Turnbull
Barry Warsaw writes:

  I agree, I just don't think message-ids are user friendly enough to  
  be this canonical url.  Especially in this context, which is exactly  
  where urls are thrown in users faces.  An archiving service is  
  exactly the right place for redirecting human readable urls to the  
  archiver's canonical url (by, I agree, 302).

I'm confused (to be precise, you're confusing me).  If human readable
URLs are exactly right for redirection to the canonical URL, why does
the canonical URL need to be user friendly?

A quick remark: the git SCM uses BASE16 SHA1s for object names, but
allows you to abbreviate them to the unique prefix.  A friendly
archive could do the same for your BASE32 ids.

Without going much into implementation, here's how I would write the
conformance section for our RFC.  The point is that I don't see any
need to discuss user-friendliness or the implementation of UUIDs for
the RFC!  This means that getting those right from the start is
not that important.

0. Conformance

0.1 List managers

A conforming list manager MUST provide the List-Archive header
field if the post is being archived.

A conforming list manager MAY provide the List-Archive-UUID header
field.  If so, the value MUST be guaranteed unique, and it MUST be
present in the post as provided to the archiver.  The contents of
this header need not be distinct from the contents of the
Message-ID header, as long as the uniqueness guarantee is
maintained.

0.2 Archives

A conforming archive MUST reserve the namespaces message-id/ and
list-post-id/ relative to its base URL for the uses described
below.

A conforming archive MUST support retrieval by Message-ID, using
the namespace message-id/$(MESSAGE-ID) relative to its base URL.
The archive specified in the List-Archive header field MUST
support access using the value of that field as its base URL.

A conforming archive SHOULD support retrieval by UUID, using the
namespace list-post-id/$(LIST-ARCHIVE-UUID) relative to its base
URL.  If the scheme is http or https, a conforming archive
that does not support retrieval by UUID SHOULD return status 501
NOT IMPLEMENTED with an entity explaining that retrieval by UUID
is not implemented.

A conforming archive MAY support friendlyurls for use where
space is constrained (eg, in a post's footer).  A conforming
archive may support any other URIs it wants to, too.wink  A
third party SHOULD be able to regenerate a friendlyurl from the
original message contents.

0.3 Software

Conforming archive software SHOULD provide interfaces for
generating UUIDs and friendlyurls, if retrieval is supported.
Conforming list managers SHOULD use these interfaces.

Some comments:

The interfaces for generated URLs should be provided as command line
utilities as well as callable functions.

Although the conformance level for friendlyurl support is may, I
expect that essentially all archives will support friendlyurls.

The namespace for UUIDs and friendlyurls should probably be more
restricted than any valid URI.

List manager denotes any source of archival content (eg, you could
imagine a user storing their outbox in a archive, so that the list
manager would actually be the user's MUA).  The namespaces suggested
above are good enough, I think, but there may be better ones.

Instead of 501 NOT IMPLEMENTED, I considered 410 GONE, but that
implies a request to delete the reference.  Since this is implemented
as a header in the post, the archive could be augmented to support it
later.

In the phrase guaranteed unique, guaranteed means to the level
provided by uuidgen or standard Message-ID generators.

Generation of friendlyurls or unique ids based on message body content
is probably a bad idea.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-25 Thread Stephen J. Turnbull
Barry Warsaw writes:

  Yes, definitely.  What do you think of the base32 examples I have on  
  the wiki page?

They're somewhat better than Message-IDs for readability, but they're
not user-friendly.

  On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
  
   It seems silly to generate nice short links but then use message-id.

The use case for the message-id is not people.  It's software, which
doesn't much care about nice short.  But the developers debugging
and maintaining the software will thank us for the ease of verifying
that the URL goes to the right place.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 Notice that of 325146 total messages, 624 of them had no message-id
 header.  Even if you aggregate dup+col, you're still looking at a
 total duplicate rate of 0.29%.

Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
If that's not the case, the mail transfer agent is broken. I think it's
better to go ahead and use the mesage-id, rather than concoct
yet another this time we mean it! unique identifier. This is a
cost/benefit thing; the cost is some real world collisions, the benefit
is a conceptually simpler system. Conceptually simpler things are
good especially when implemented all over the place.

Which brings me to suggestion #2, which is go ahead and write
an RFC on how list servers should embed archival links in messages.
This sounds like an internet wide interoperability issue as much as
something mailman specific. Why not come up with a scheme usable
by all list servers? And also describe a specification third party archival
services can comply to. Besides, I've always wanted to help write
an RFC. If we go that route, it would be good to get input from a range
of people - one person I'd suggest is Earl Hood, author of mhonarc.

Thoughts?

Jeff





 While I'm almost tempted to ignore a
 hit rate that low, if you think of an archive holding 1B messages,
 you still get a lot of duplicates.

 OTOH, the rate goes down even lower if you consider the message-id
 and date headers.  (Note, I did not consider messages missing a date
 header).  How likely is it that two messages with the same message-id
 and date are /not/ duplicates?  Heck, at that point, I'd feel
 justified in simply automatically rejecting the duplicate and
 chucking it from the archive.

 I spent a /little/ time looking at the physical messages that ended
 up as true collisions.  Though by no means did I look at them all,
 they all looked related.  For example, with strategy 2 some messages
 look like they'd been inadvertently sent before they were completed.
 I need to see if there's any similarities in MUA behind these, but
 again, I think we might be able to safely assume that collisions on
 message-id+date can be ignored.

 That leads me to the following proposal, which is just an elaboration
 on Stephen's. First, all messages live in the same namespace; they
 are not divided by target mailing list.  Each message has two
 addresses, one is the Message-ID and one is the base32 of the sha1
 hash of the Message-ID + Date.  As Stephen proposes, Mailman would
 add these headers if an incoming message is missing them, and tough
 luck for the non-list copy.  The nice thing is that RFC 2822 requires
 the Date header and states that Message-ID SHOULD be present.

 Why the second address?  First, it provides as close to a guaranteed
 unique identifier as we can expect, and second because it produces a
 nearly human readable format.  For example, Stephen's OP would have a
 second address of

   mid
 '[EMAIL PROTECTED]'
   date
 'Wed, 04 Jul 2007 16:49:58 +0900'
   # XXX perhaps strip off angle brackets
   h = hashlib.sha1(mid)
   h.update(date)
   base64.b32encode(h.digest())
 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'

 I like base32 instead of base64 because the more limited alphabet
 should produce less ambiguous strings in certain fonts and I don't
 think the short b64 strings are short enough to justify the
 punctuation characters that would result.  While RFC 3548 specifies
 the b32 alphabet as using uppercase characters, I think any service
 that accepts b32 ids should be case insensitive.  A really Postel-y
 service could even accept '1' for 'I' and '0' for 'O' just to make it
 more resilient to human communication errors.

 I'd like to come up with a good name for this second address, which
 would suggest the name of the X- header we stash this value in.  X-
 B32-Message-ID isn't very sexy.  Maybe X-Message-Global-ID, since I
 think there's a reasonable argument to make that for well-behaved
 messages, that's exactly what this is.

 So now, think of the interface to a message store that supports this
 addressing scheme.  Well it's something like:

 class MessageStore(Interface):
  def store_message(message):
  Store the message.

  :raises ValueError: when the message is missing either the
 Message-ID
  header or a Date header.
  :raises DuplicateMessageError: when a message in the store
 already has
  a matching Message-ID and Date.  An archive is free to raise
 this exception
  for duplicate Message-IDs alone.
  

  def get_message_by_global_id(key):
  Locate and return the message from the store that matches
 `key`.

  :param key: The Global ID of the message to locate.  This is
 the
  base32 encoded SHA1 hash of the message's Message-ID and Date
  headers.
  :returns: The message object matching the Global ID, or None
 if there
  is no such match.
  

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes:

   Notice that of 325146 total messages, 624 of them had no message-id
   header.  Even if you aggregate dup+col, you're still looking at a
   total duplicate rate of 0.29%.
  
  Message ID's are supposed to be unique.

Fortunately, a rule more honored in the observance than the breach.
Nonetheless, it *is* breached.  The Postel Principle applies here, IMO.

  better to go ahead and use the mesage-id, rather than concoct
  yet another this time we mean it! unique identifier.

That's not the point.  We're not going to impose this on senders;
that's what Message-ID is for, as you say.  If a sender won't provide
a proper Message-ID, third parties who get a CC are just out of luck.

I simply think we should be prepared for applications where relying on
the sender to supply a UUID is not acceptable; we need to be able to
provide one ourselves.  Creating UUIDs is a solved problem, after all.
So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.

Then we say that an archive SHOULD provide access to the resource via
Message-ID if available, and define how to construct that URL from the
List-Archive and Message-ID headers.

  Which brings me to suggestion #2, which is go ahead and write
  an RFC on how list servers should embed archival links in messages.

I think Barry already suggested that?  Anyway, +1.  But remember, a
standards-track RFC should have a working implementation to point to.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread John A. Martin
 st == Stephen J Turnbull
 Re: [Mailman-Developers] Improving the archives
  Tue, 24 Jul 2007 15:56:35 +0900

st Jeff Breidenbach writes:
  Notice that of 325146 total messages, 624 of them had no
  message-id header.  Even if you aggregate dup+col, you're
  still looking at a total duplicate rate of 0.29%.

 Message ID's are supposed to be unique.

st Fortunately, a rule more honored in the observance than the
st breach.  Nonetheless, it *is* breached.  The Postel Principle
st applies here, IMO.

Taking be conservative in what you do as being at least as important
as be liberal in what you accept from others, the devil can quote
this scripture to support simplicity in this instance, IMHO.

 better to go ahead and use the mesage-id, rather than concoct
 yet another this time we mean it! unique identifier.

st That's not the point.  We're not going to impose this on
st senders;

I read the quote as meaning this time we mean it really is unique,
imposing nothing on senders.

st that's what Message-ID is for, as you say.  If a sender won't
st provide a proper Message-ID, third parties who get a CC are
st just out of luck.

Right.  Maybe that will encourage compliance.  The complexity of
catering to brokenness in this instance may be too high a price to
impose on the all.

jam


pgpVlVlfc9EJj.pgp
Description: PGP signature
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
John A. Martin writes:

   better to go ahead and use the mesage-id, rather than concoct
   yet another this time we mean it! unique identifier.
  
  st That's not the point.  We're not going to impose this on
  st senders;
  
  I read the quote as meaning this time we mean it really is unique,
  imposing nothing on senders.

Ah.  If so, my reply is if you want something done right, do it
yourself.  *All robust databases assign a unique ID to each record.*
Why shouldn't a mailing list archive do so?

  Right.  Maybe that will encourage compliance.  The complexity of
  catering to brokenness in this instance may be too high a price to
  impose on the all.

What complexity?  Mailman just does

   msg['X-List-Archive-Received-ID'] = Email.msgid()

(or however the message ID generator is spelled).  After that, it's up
to the archiver whether to do anything with it or not.  I proposed a
way that it could be used; if that's considered too complex, fine.
But simply assigning one is not complex or otherwise very costly.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.

So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the canonical URL
for a message. There's a good chance these archival URLs will be
served by an HTTP redirect. So let's not use the word canonical. :)

What complexity?  Mailman just does

  msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now
keep track of both the message-id and the x-list-archive-received-id.
That's two namespaces that almost do the same thing. It's easier
for the archive server to keep track of one name space than two,
and - most importantly - conceptually simpler.

From the perspective of the assorted list servers, it's easier to
do nothing than to do something. So if they can get by with
just message-id (which is already implemented) not have to add
x-list-archive-received-id, that's a smoother implementation path.
If we base on message-id, archival servers will be able to
retroactively add support for all their stored messages, even those
that are ten years old. And users holding an old message will be
able to figure out that URL without doing any computational
gymnastics.

Put another way, there's the possibility to reduce the archive
servers' implementation to search for this mesage-id which is
something really useful to have anyway, and therefore likely to
get wider support.

In addition, Barry was talking about concocting a unique
identifier from the Date field and Message-ID. I'm not a big fan of
this idea, because the date field comes from the mail user agent
and is often wildly corrupt; e;g; coming from 100 years in the future.
Very painful if the archive is showing most recent message first.
Therefore an archival server is very likely to determine message date
from the most recent received header (generally from a trusted mail
transfer agent) rather than the date field. From the archive server's
perspective, the best thing to do with the date field is throw it away.

So for these reasons, I'd rather stick with message-id and risk
some real world collisions, instead of introduce another identifier.
If the list server receives a message with no message-id, by all means
create one on the spot.  To me, this feels like the sweet spot in terms
of cost benefit. The main thing that bugs me is message-ids are long,
which makes them awkward to embed in a URL in the footer of a
message.

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Dale Newfield
Jeff Breidenbach wrote:
 In addition, Barry was talking about concocting a unique
 identifier from the Date field and Message-ID. I'm not a big fan of
 this idea, because the date field comes from the mail user agent
 and is often wildly corrupt; e;g; coming from 100 years in the future.

Oh--I was assuming the Date to which he was referring was the current 
timestamp at which mailman was processing the message.  I was going to 
say that this guarantees uniqueness, but I guess there are parallel 
mailman implementations where more than one machine/processor are all 
serving the same list, and then two different machines/processors might 
wind up with identical timestamps while processing two different messages.

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Terri Oda
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
 So we just specify a header to put it in, and subscribers will be  
 able
 to use it, per definition of a canonical URL.
 It is the archive server's job to decide what is the canonical URL
 for a message. There's a good chance these archival URLs will be
 served by an HTTP redirect. So let's not use the word canonical. :)

Someone already pointed out that the message ID is a bit long for a  
URL, so I'm guessing we're going to want some sort of shorter  
sequence number for messages for linking purposes.

Regardless of whether we *need* to generate our own unique ID, I'm  
leaning towards the thought that we're going to *want* to generate  
our own for usability reasons.  In a perfect world, i think we'd have  
a sequence number so I could visit http://example.com/mailman/ 
archives/listname/204.html and know that 205.html would be the next  
message to that list, but any short unique id would do if sequence  
numbers are too much of a pain.

It seems silly to generate nice short links but then use message-id.   
If we can generate nice short links, we might as well use 'em  
throughout, unless you really think the default use of the archive  
will be to search it by messageid (which I sincerely doubt, from my  
user experiences).

  Terri

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 Regardless of whether we *need* to generate our own unique ID, I'm
 leaning towards the thought that we're going to *want* to generate
 our own for usability reasons.  In a perfect world, i think we'd have
 a sequence number so I could visit http://example.com/mailman/
 archives/listname/204.html and know that 205.html would be the next
 message to that list, but any short unique id would do if sequence
 numbers are too much of a pain.

I agree there's a lot of usability benefits from short URLs, but perhaps
this is the job of the archive server, and not the list server. Mharc (an
archive server) is a great example here. Mharc's canonical message
format is pretty human friendly.

http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg0.html

Unfortunately, there's no trivial way for the list server to know that human
friendly URL when the message is sent out. Fortunately, Mharc is also
happy handles messages by message-id, which the list server does know
about.

http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users[EMAIL PROTECTED]

Had I been the implementer, I'd probably have made mharc do an HTTP 302
redirect from the longer URL to the shorter URL. But that's besides the point.
The point is we have an existing, working, happy archival server, and it would
be really nice if list servers (such as mailman) were compatible. And by
compatible, I mean offering the capability of embedding an archival URL in the
footers of messages.

-Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:

 On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
 I've looked at a few lurker archivers and I wasn't blown away by its
 user interface.  That's apparently highly configurable though.

 I've been doing a lot of thinking about interface, and I'm coming to
 the conclusion that something more like a web bulletin board is
 probably the way to go, given that people use them all the time
 without much trouble and with a fairly minimal amount of whining. ;)

I like this for several reasons.  I've long wanted a bridge between  
the traditional mailing list and a forum because to me they're  
related along a spectrum of emotional investment.

What I mean is this.  For the subjects and projects I care deeply  
about, I join the mailing list.  I want to be intimately involved in  
the day-to-day collaboration that being subscribed gives me.  I care  
enough about that that I'm willing to put up with the pain that comes  
along with mailing lists, such as the overhead for subscribing,  
deleting topics I don't care about, the occasional spam, the overhead  
of going on vacation or leaving the list, etc.

But there are even more topics or projects that I have only a  
fleeting interest in.  Say I find a bug in some X program, or wake up  
and decide to learn how to use setuptools, or find that some recent  
update broke my Linux server.  In all those cases, I might want to  
start a thread of discussion or ask a question, and be very involved  
in that thread for a week or two.  Then, my interest wanes, or I get  
my question answered, or other projects pique my interest.  Mailing  
lists are pretty bad at managing those kinds of fleeting involvement,  
but forums are quite nice.  There's usually fairly low overhead (and  
probably even less if OpenID and such were in widespread adoption)  
for joining, and when I lose interest the forum doesn't fill up my  
inbox.  OTOH, forums seem good for short 'instant' messages, but not  
so good (IMO) for free ranging, detailed discussions.  So there's a  
spectrum.

 I'm trying to use interfaces to things like comment systems (which
 are often threaded -- picture the slashdot stuff, maybe?) and popular
 boards like phpbb (which isn't threaded beyond separate topics) as
 guides to how people usually deal with conversations on the web.

 It'd actually be fairly easy, at that point, to just put a posting
 interface into the archives (yes, you'd have to be logged in, and
 yes, this means your password becomes that bit more valuable because
 someone having it can pose as you to the list... but they could do
 that by spoofing your email address so I'm not too concerned). But
 then people who don't like email or just want to pop by and check the
 list quickly could actually use mailman like a web board, which is
 something I'm pretty sure would get used (I know my users have asked
 for it in the past).

Heck, /I'd/ use it, so what more justification do we need? :)

 I've been drafting simple prototype interfaces in my head, trying to
 keep potential architectures in mind.  I'm hoping I'll have time this
 week to code some up HTML and see how well they actually work when
 they're not just inside my head. :)

I'd love to see the prototypes once you've committed them to HTML.   
The one important thing is that the individual postings will need the  
equivalent of a stable archive URL (i.e. permlink) that could be  
passed around, added to web pages, etc.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby
V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG
zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj
8Y/9XxPjX5Q=
=IRq2
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:

 Which brings me to suggestion #2, which is go ahead and write
 an RFC on how list servers should embed archival links in messages.
 This sounds like an internet wide interoperability issue as much as
 something mailman specific. Why not come up with a scheme usable
 by all list servers? And also describe a specification third party  
 archival
 services can comply to. Besides, I've always wanted to help write
 an RFC. If we go that route, it would be good to get input from a  
 range
 of people - one person I'd suggest is Earl Hood, author of mhonarc.

I've always thought that an RFC-like spec that describes how a  
generic mailing list manager would interoperate with a generic  
archiving service is the way to go.  I've written up a somewhat more  
formal spec of what I've implemented MM3 currently here:

http://wiki.list.org/display/DEV/Stable+URLs

If this looks good, I'd be happy to approach some of the related  
communities to try to get buy-in.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt
efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC
ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3
8CmG/bB9LTo=
=EyoU
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:

 I simply think we should be prepared for applications where relying on
 the sender to supply a UUID is not acceptable; we need to be able to
 provide one ourselves.  Creating UUIDs is a solved problem, after all.
 So we just specify a header to put it in, and subscribers will be able
 to use it, per definition of a canonical URL.

 Then we say that an archive SHOULD provide access to the resource via
 Message-ID if available, and define how to construct that URL from the
 List-Archive and Message-ID headers.

I think there's two approaches we could argue for.  One is for the  
mailing list manager to craft a UUID out of whole cloth and stick  
that in a header.  Then any downstream archiver would be obliged to  
use that header value as the canonical address of the message, with  
an alternative path to the message via the Message-ID (possibly  
returning a list of matching messages when there are collisions).

The second approach, and the one that I favor, is to use the Message- 
ID (and the Date) header on the original message as the UUID,  
properly handling corner cases like duplicate headers or missing  
header.  This UUID servers as the basis for the address to the  
message resource just like above.

I like the second approach better because in the case where you start  
with an off-list copy of the message, you have a decent enough chance  
of getting to the archived message, or at least to a resource  
containing a link to the message.  The first alternative would  
require access to the list copy.

Imagine if every archiver supported my proposal, knowing just the  
Message-ID and Date header, you could get to that message from almost  
anywhere, just by using the UUID as a relative URL rooted at say  
http://www.mail-archive.com, http://groups.google.com, http:// 
mail.python.org/pipermail, or whatever.  That would be pretty neat.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG
pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW
zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3
N5iq3BWoMK8=
=fSNC
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:

 What complexity?  Mailman just does

  msg['X-List-Archive-Received-ID'] = Email.msgid()

 Easy to introduce, harder to deal with. The archival server would now
 keep track of both the message-id and the x-list-archive-received-id.
 That's two namespaces that almost do the same thing. It's easier
 for the archive server to keep track of one name space than two,
 and - most importantly - conceptually simpler.

True, but an archiver already has to handle collisions on the Message- 
ID so in a sense, you have to maintain multiple paths to the same  
message, don't you?

So I like my proposal because it imposing nothing additional on the  
MUA or MTA, a tiny bit more on the MLM, and some extra work (though I  
think not much) on the archiving agent.  What you gain from my  
proposal over a pure Message-ID approach is guaranteed uniqueness  
given the list copy, and human friendlier urls.

 From the perspective of the assorted list servers, it's easier to
 do nothing than to do something. So if they can get by with
 just message-id (which is already implemented) not have to add
 x-list-archive-received-id, that's a smoother implementation path.
 If we base on message-id, archival servers will be able to
 retroactively add support for all their stored messages, even those
 that are ten years old. And users holding an old message will be
 able to figure out that URL without doing any computational
 gymnastics.

All these are still true with my proposal, except with the  
observation as Stephen points out that given a URL based on sender- 
provided headers, you must be prepared to deal with collisions, so  
sometimes your resources will return lists.  The advantage of adding  
a bit of MLM-provided information is that given the list copy you can  
guarantee uniqueness, and given the off-list copy you can get to a  
resource that contains a link to the message you want.

 Put another way, there's the possibility to reduce the archive
 servers' implementation to search for this mesage-id which is
 something really useful to have anyway, and therefore likely to
 get wider support.

 In addition, Barry was talking about concocting a unique
 identifier from the Date field and Message-ID. I'm not a big fan of
 this idea, because the date field comes from the mail user agent
 and is often wildly corrupt; e;g; coming from 100 years in the future.
 Very painful if the archive is showing most recent message first.
 Therefore an archival server is very likely to determine message date
 from the most recent received header (generally from a trusted mail
 transfer agent) rather than the date field. From the archive server's
 perspective, the best thing to do with the date field is throw it  
 away.

Throw it away or hide it?  The former would be a problem, but not the  
latter.  Does your archiver keep a canonical copy of the message as  
you received it?  If so, then you preserve the original Date header  
enough for the calculation to occur, even if you hide the Date  
header, or display a Received header date when you render it to  
HTML.  That doesn't matter of course.

But I should point out that I'm not married to including the Date  
header in the hash.  I like it because it appears to reduce  
collisions which I care about.  But I still like using the base32  
sha1 hash instead of the raw Message-ID because I think it's easier  
for humans to use, read, speak, and copy.  Of course this doesn't  
mean that you need to disable your search-by-Message-ID feature!

 So for these reasons, I'd rather stick with message-id and risk
 some real world collisions, instead of introduce another identifier.
 If the list server receives a message with no message-id, by all means
 create one on the spot.  To me, this feels like the sweet spot in  
 terms
 of cost benefit. The main thing that bugs me is message-ids are long,
 which makes them awkward to embed in a URL in the footer of a
 message.

Another advantage for the URL scheme I propose.  You know you're  
going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- 
seqno

(32 == base32(sha1digest(data))
(1 == / divider)
(#digits-in-seqno == e.g. len(str(seqno))

You should be able to keep things in the 60-70 character range,  
including the host name.  That doesn't seem too bad.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT
1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU
UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT
1/qaGckINUg=
=4uwH
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 

Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Stephen J. Turnbull
Jeff Breidenbach writes:

  So we just specify a header to put it in, and subscribers will be able
  to use it, per definition of a canonical URL.
  
  It is the archive server's job to decide what is the canonical URL
  for a message. There's a good chance these archival URLs will be
  served by an HTTP redirect. So let's not use the word canonical. :)

If it's not going to be canonical (I forget if there's a standard
for that word :), what is the point in writing an RFC?

  What complexity?  Mailman just does
  
msg['X-List-Archive-Received-ID'] = Email.msgid()
  
  Easy to introduce, harder to deal with. The archival server would now
  keep track of both the message-id and the x-list-archive-received-id.
  That's two namespaces that almost do the same thing.

The implementations are similar, and there is nearly a one-to-one
correspondence.  But the semantics are very different.  Message-ID is
untrustworthy, the internal ID is trustworthy.

  So for these reasons, I'd rather stick with message-id and risk
  some real world collisions, instead of introduce another identifier.

Go ahead and stick with message-id if *you* like, but please don't
tell *me* what risks I have to accept.

There needs to be a way to *enforce* uniqueness, and it *must* be
specified by the RFC in order for archive implementations to be
interoperable.  Note that word specify; I do not insist that this
level of robustness be *required*.  But if we don't specify it now,
people who want such robustness will have to do all this work again,
and possibly will end up with something that some servers conforming
to your RFC will not conform to.

It is possible that most archivers will simply use the message ID, and
do something brutal in the rare case of a collision.  That's fine.
But an archiver that wants to provide a canonical URL which is
guaranteed to uniquely and losslessly identify a post in its archive
should have a standard way to do that.

  The main thing that bugs me is message-ids are long, which makes
  them awkward to embed in a URL in the footer of a message.

The footer URL is of no concern in this discussion.  There is not
going to be a requirement that footer URLs be canonical, not if I
have any say in the matter.  The canonical URL will be in (or be
constructed from) the message header.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-24 Thread Jeff Breidenbach
 What you gain from my proposal over a pure Message-ID approach
 is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the archive server will
be SMTP, and spurious messages can be injected. Finally, from the archive
server's perspective, some of the MLMs might make mistakes - just like
from the MLM's perspective, some of MTAs might make mistakes in
setting message-id. So I don't think the proposed SHA1(date, message-id)
scheme buys a hard guarantee of uniqueness. Every component has
to protect themselves, but none can solve the world's problems.

So that moves us to how many collisions are reduced in practice.
I have a question about the numbers Barry mined from the python
lists. Are the collisions really that high? One should not count
messages without a message-id, because the MLM can and should
create one in that case.

One should also not count collisions of messages going to different
lists. Here's why. Let's say message M is cross posted to lists L1 and
L2. Even though it is the same message, there are now two different
contexts. (For example, people visit M at archive L1 should get a
completely different experience if they hit next message and people
visiting M at archive L2.)

So I'd be curious what the collision numbers come to with these two
factors taken into account. The other takeaway  is list name really
should be part of the URL to get proper context. The earlier example
from Mharc does this.

 and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted
together for old messages,  but I think the usability benefits of short
URLs (short enough that they can comfortably fit inside message bodies)
outweighs this drawback. By the way, is SHA-1 still in favor? My
impression was it was fading away after the Shandong University team
partially cracked it.

 Throw it away or hide [Date]?  The former would be a problem,
 but not the latter.

Thrown away. My favorite archival service is based on mhonarc,
and raw mail goes into offline cold storage. Of course this can be
changed for the future messages with some pain, but there's no
reasonable way for myself (or any other mhonarc users in the
same predicament) to retrofit against Date based URLs. For the
record, here's what mhonarc embeds in each HTML page it
produces because these were considered the important headers.
In this message sent from Australia, the date shows a timezone
of UTC -0700, because it was pulled from the received header.

!-- MHonArc v2.6.15 --
!--X-Subject: [Gossip] Re: green#45;travel resources {webliographies} --
!--X-From-R13: [nephf Z. Saqvpbgg zraqvpbgNlnubb.pbz --
!--X-Date: Wed, 26 Apr 2006 00:27:27 #45;0700 --
!--X-Message-Id: [EMAIL PROTECTED] --
!--X-Content-Type: text/plain --
!--X-Reference: [EMAIL PROTECTED] --
!--X-Head-End--

So my main request is to double check the numbers, see if using
Date really buys as much as one thinks. I'll keep digesting the
other aspects of the wiki page.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-22 Thread Terri Oda
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
 I've looked at a few lurker archivers and I wasn't blown away by its
 user interface.  That's apparently highly configurable though.

I've been doing a lot of thinking about interface, and I'm coming to  
the conclusion that something more like a web bulletin board is  
probably the way to go, given that people use them all the time  
without much trouble and with a fairly minimal amount of whining. ;)  
I'm trying to use interfaces to things like comment systems (which  
are often threaded -- picture the slashdot stuff, maybe?) and popular  
boards like phpbb (which isn't threaded beyond separate topics) as  
guides to how people usually deal with conversations on the web.

It'd actually be fairly easy, at that point, to just put a posting  
interface into the archives (yes, you'd have to be logged in, and  
yes, this means your password becomes that bit more valuable because  
someone having it can pose as you to the list... but they could do  
that by spoofing your email address so I'm not too concerned). But  
then people who don't like email or just want to pop by and check the  
list quickly could actually use mailman like a web board, which is  
something I'm pretty sure would get used (I know my users have asked  
for it in the past).

I've been drafting simple prototype interfaces in my head, trying to  
keep potential architectures in mind.  I'm hoping I'll have time this  
week to code some up HTML and see how well they actually work when  
they're not just inside my head. :)

  Terri

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-22 Thread Dale Newfield
Terri Oda wrote:
 I've been doing a lot of thinking about interface, and I'm coming to  
 the conclusion that something more like a web bulletin board is  
 probably the way to go

For public lists, the answer may lie in external tools like nabble.com 
or mailinglistarchive.com

Of course, that doesn't help for lists wishing to keep their content 
private.

-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-21 Thread A.M. Kuchling
On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote:
 Cool.  I wonder if lurker is compatible with Python 2.5's  
 mailbox.Maildir implementation and whether the two could share the  
 maildirs.  Thanks for the information!

It had better be -- Maildir has a published specification.  If there's
an incompatibility, that would be a bug in either mailbox.py or
lurker.

--amk

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote:

 Maybe a way to think about this is that the canonical url is based on
 the message-id, but then there's some way to distill even this down
 to a tinyurl or simple integer that would be stable in the face of
 full archive regenerations.

 I'd suggest the reverse. Keep the canoncical archive URL short and
 sweet, and then use a URL redirection service to map message-id's
 to those URLs. It is the archiver's job to make it all work. For  
 example,
 the canonical  archive URL might stay exactly the way it is in  
 pipermail.
 But the archival link embedded in the message would instead go
 to a redirection service.

I agree.  My proposed global message id is exactly the canonical  
archive URL, although it's relative to the archiver's base url, as  
given in the List-Archive header.

 http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html
 http://mail.codeit.com/[EMAIL PROTECTED]

 The one other thing I'd ike to revisit is integration with third party
 archival services. There are two obvious integration points; one is a
 button in the Mailman list admin user interface that says archive  
 with
 service X not unlike the setting in Firefox that basically says  
 search
 with service X.

I think we could define an interface that archive services would have  
to meet in order to be available to list admins.  The site admin  
would of course have to enable them site-wide first.  Why kinds of  
information would be required?

- - List-Archive base url
- - Message injection procedure
- - Additional subscription procedures

The nice thing is that if my global id idea works, the injection  
process can be completely asynchronous.

 The other integration point is the archival link
 discussed above. In which case it would be set to something like.

 http://third-party-service/[EMAIL PROTECTED]

All we'd need to know is the third party's List-Archive header  
value.  The last part of the path would always be the global message id.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCqSnEjvBPtnXfVAQJq7gQArkmEb3DqrOaRTdYnQ0SCOrqWtiPxNJOd
555+JiHt/mEqPTuS/cF1GfdckwrQXbUJYWeO56dXzfbXtCVaW54h4k/95RI2/mqK
HR2BKcoVW/dDfYUd2V2Vbqdc7trVIy3oGdzQb24Pu9bIptqbdVSpnmx8jm9GIOi1
UAkJp+Ff5nc=
=lE32
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote:

 Barry Warsaw wrote:
 Maybe a way to think about this is that the canonical url is based on
 the message-id, but then there's some way to distill even this down
 to a tinyurl or simple integer that would be stable in the face of
 full archive regenerations.

 The resistance to basing this on message-id has always been that  
 there's
 no guarantee of uniqueness...
 ...but I believe each list has some sort of counter for how many
 messages it's seen, so we could add another header with that  
 number, and
 use as a unique id the two concatenated together...
 (That way the archiver can know from the content of the header exactly
 how to generate the same unique id as mailman, which would allow  
 for the
 url-in-the-footer to happen w/o first hitting the archiver.)

I'm not crazy about this idea for a couple of reasons.  First, it  
means that someone who has a copy of the message that didn't come  
from the list (e.g. one of the two you will get of this message),  
cannot calculate this unique ID.  Second, things can happen to a list  
that might cause this sequence number to get corrupted.  Maybe a list  
will get deleted and then recreated.  Maybe it will get moved and the  
sequence number will get reset in the move.  Maybe the list will be  
upgraded to a new version of Mailman.

I think we can do just as well by using Message-ID + Date and get  
very low collision rates.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCobXEjvBPtnXfVAQIHFQP/Sz6WVqyFmo0lraw0hyyP5x4AhgBPDQmA
/rFfSBRGbdORLXA2Ss0YdhI5cy8n7LMSsLawgtSt+JA7F5IEiC6Hk5C1M8C+Oe09
4ICYEuuL+gcXPPVc4aYtxp33HvPBFCzPJkGBS2PHaqCQkYIKdWHCtDZ8iLWCOxjc
b674lsQk9tM=
=a09C
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 8, 2007, at 1:06 AM, Paul Wise wrote:

 My personal opinion is that pipermail should be removed and mailman
 should not contain a default archiver since there are plenty of good
 archivers already (lurker, mhonarc etc). Adding wrappers around them
 would be simpler than reimplementing them.

My hesitation to this has always been the turnkey question.   
Pipermail has it's problems but it /does/ allow small sites to get  
going very quickly with a full(-ish) solution.

It may be that most people get their Mailman installation from their  
distro or hosting service and this is no longer as important.  In  
that case, I still wouldn't chuck Pipermail, but I would try to see  
if we can adopt Jeff's goal of making the archive selection pluggable  
and easily selectable by list admins.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCt4HEjvBPtnXfVAQJHQwP+P4KAQaA7uEeISQjFyb3zoMvOWwgoW3zH
taWsnVAhVmAF/hJBWDn7JtXwWiLw7ngCtGHp3MBKGBKzBjJP7ZizEMNfziaB+OoO
LOyF7sYB+KhKVi+Il7XnHYIjh6DSD8kullP+G/UNtuIsFnNs+aTntndfMKJG2Zct
E7M0F1Ok8FE=
=xXQJ
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote:

 John A. Martin writes:

 In the absence of a Message-ID
 on an outgoing mail message many if not most MTAs will add one.  Why
 not let Mailman anticipate the need to add a Message-ID when  
 archiving
 the message rather than leaving it to the outgoing MTA?

 Quite.

 My reason for saying last resort is simply that this is not
 predictable to third parties.  Eg, I send you (a non-subscriber) a
 message with CC and no Message-ID.  You'd like to find the thread in
 the archives.  You may as well just do a linear search on that month's
 threads.

Yep, and I say tough.  Let John complain to Stephen to fix his MTA  
to add those Message-IDs so Mailman doesn't have to. ;)

 An URL based on an MD5 of the message body in theory would work, but
 in the presence of non-ASCII bodies, structured MIME, ML digests, and
 various MTA autoconversions, that seems fragile.

Agreed, and it would do no better, in fact worse, than base32(sha1 
(message-id + date))

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCuW3EjvBPtnXfVAQKx/AP9EUxDQmp1tiCEqJqVSFWeicq/9lThnMZN
58UUEPA47wPa1SJSk6z7+0vSfqTskwO1Frnn8OJ6X+MJAxCX4Hr86uBOnK9XW2AK
byCfeYHBdapGlrsxmPd0so+FFJODWWRu7+yyKTw6ApDwVevatEEIMPlZkMALMv5S
axC5ttHfR2E=
=c0pw
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 5, 2007, at 12:09 PM, John Dennis wrote:

 A little over a year ago I went on a search to find the best open  
 source
 archiver and at that time I came up with Lurker
 (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
 major new revision. I also believe Lurker is the archiver used by
 Debian.

 So if you want to leverage existing open source archiving or at least
 look at an example of what would be necessary to allow easy easy
 external archiving integration with Mailman you might want to look at
 Lurker.

I've looked at a few lurker archivers and I wasn't blown away by its  
user interface.  That's apparently highly configurable though.

Lurker's GPL2 so that's fine.  I'd be quite hesitant about shipping  
Mailman with Lurker because it's something we don't control and it's  
not Python.  But I would be totally open to working with the Lurker  
developers on creating an easy bridge between the two systems.   
Perhaps this dovetails with Jeff's suggestion of easier integration  
with external archiving systems.

Does anybody have contacts with the Lurker community that could cross- 
post a new thread to get the discussion going?

(The same goes for any other archiver out there too.)

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCtBnEjvBPtnXfVAQLgJwP9HNu/r/5YYAGn0HcQAhD8b8plDSpm2tao
VcC7tROs0EyjRAQd1b3+hF102FMZzTXF/8LifgETN8K4MD9TXkxNhrTlKjmAUhLG
1tvHZT9oD73aLb81m2SuI3nbp8kQSMncPeMM4u1vGzpXfCYGK4chAPyIJ1Z5MNqj
6byAgVpwZEo=
=qjmf
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes:

  First, I want to avoid talking about file system layout.  To me,  
  that's an implementation detail we needn't worry about right now.   

Agreed.

  How likely is it that two messages with the same message-id and
  date are /not/ duplicates?

For message id generators that include a time-stamp in the generated
id, approximately the same as the probability that two messages with
the same message-id are not duplicates, no?

  Heck, at that point, I'd feel justified in simply automatically
  rejecting the duplicate and chucking it from the archive.

I'd rather not go there.  There may be applications for the archiver
that require that all mail received be filed.

Counterproposal: have a collisions namespace, and provide an
interface for the list owner to decide what to do with them.  They
could be thrown away, they could be given an alternative global ID
somehow and added (eg, the archive page could add a See probable
duplicates too link), or they could be put into a moderation-like
queue for list admins to decide about.

  So now, think of the interface to a message store that supports this  
  addressing scheme.  Well it's something like:

I don't understand how the calling application is supposed to deal
with a DuplicateMessageError exception since it should not change
either the Message-ID or the Date if present.

I see this as a major problem with any proposal to use only author
headers in computing the global id.

  Or by using the global id, or by rejecting messages with duplicate  
  message ids.

Er, the MTA has already accepted it.  Do you plan to generate a list
manager bounce to the poster?  This has the unpleasant misfeature that
it could be used to bounce spam off the list manager, since the poster
needs to see content to determine whether this is a multiple send or
actually the intended version after a fat-finger send; we already
know the message-id isn't good enough.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes:

  Second, things can happen to a list  
  that might cause this sequence number to get corrupted.

Add an X-Mailman-Sequence-Number header if not already present.

That doesn't deal with your other comments, but as I point out
elsewhere, if you don't use *any* Mailman-specific information in the
global ID, you have no sane way to handle collisions except throw them
away (or make the global ID refer to a collection resource, but that's
kinda unintuitive).
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:

 How likely is it that two messages with the same message-id and
 date are /not/ duplicates?

 For message id generators that include a time-stamp in the generated
 id, approximately the same as the probability that two messages with
 the same message-id are not duplicates, no?

Good point, though clearly not all message-ids have timestamp  
information in them.  It does help explain why I see 600-odd more  
collisions when taking other data into account too.  I've modified my  
script to sort collisions and dupes into maildir folders, so I'll  
take a closer look when that finishes running (it takes a long time  
to slog through all 5 mboxes, even on a fairly zippy dual-G5).

 Heck, at that point, I'd feel justified in simply automatically
 rejecting the duplicate and chucking it from the archive.

 I'd rather not go there.  There may be applications for the archiver
 that require that all mail received be filed.

True.  It would ultimately be an archiver policy though.

 Counterproposal: have a collisions namespace, and provide an
 interface for the list owner to decide what to do with them.  They
 could be thrown away, they could be given an alternative global ID
 somehow and added (eg, the archive page could add a See probable
 duplicates too link), or they could be put into a moderation-like
 queue for list admins to decide about.

I like this.

 So now, think of the interface to a message store that supports this
 addressing scheme.  Well it's something like:

 I don't understand how the calling application is supposed to deal
 with a DuplicateMessageError exception since it should not change
 either the Message-ID or the Date if present.

 I see this as a major problem with any proposal to use only author
 headers in computing the global id.

Mailman would probably log and ignore DuplicateMessageErrors.  It  
wouldn't be Mailman's responsibility to ensure the message gets  
archived, although I concede that as currently defined, you could end  
up with list copies that had a global id header that wasn't unique.   
OTOH, if the archiver implements a collision resolution policy such  
as a 'collisions' namespace, it wouldn't ever raise  
DuplicateMessageError.

 Or by using the global id, or by rejecting messages with duplicate
 message ids.

 Er, the MTA has already accepted it.  Do you plan to generate a list
 manager bounce to the poster?  This has the unpleasant misfeature that
 it could be used to bounce spam off the list manager, since the poster
 needs to see content to determine whether this is a multiple send or
 actually the intended version after a fat-finger send; we already
 know the message-id isn't good enough.

Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce.   
But it would have to be subject to the same bounce rules as any other  
auto-response which could be used as a spam vector, e.g. limit the  
number of bounces per time period and don't include the entire  
original message in the bounce (as both can be, and are used as spam  
vectors).

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta
Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw
8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt
EBp5YCMqxv8=
=5tjc
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham

On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
 I've looked at a few lurker archivers and I wasn't blown away by its
 user interface.  That's apparently highly configurable though.

I'd be inclined to agree wrt user interface. Documentation regarding
this, and anything else to do with lurker, appears somewhat scarce -
speaking as someone who has just migrated the exim.org lists to using
lurker archiving. [previously we used mailman with the MHonArc/pipermail
hybrid]

I am considering starting a set of pages within our wiki about use of
lurker (we tend to cover almost everything else about mail so why not
that).

 Lurker's GPL2 so that's fine.  I'd be quite hesitant about shipping
 Mailman with Lurker because it's something we don't control and
 it's not Python.  But I would be totally open to working with the
 Lurker developers on creating an easy bridge between the two systems.
 Perhaps this dovetails with Jeff's suggestion of easier integration
 with external archiving systems.

Integration with externals feels like a good way to go.

 Does anybody have contacts with the Lurker community that could cross-
 post a new thread to get the discussion going?

The ML appears... lacking in vigor..

BTW lurker gives all messages an ID which is 3 parts separated by
periods. The first part is a date field - ie 20070720, the second part
is the receive time, UTC, as 6 digits, and the final part is some form
of hex id. The nice part is if you quote just the first (or first 2)
parts of message ID you get messages around that time...

Nigel.

--
[ Nigel Metheringham   [EMAIL PROTECTED] ]
[ - Comments in this message are my own and not ITO opinion/policy - ]



___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote:


 On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
 I've looked at a few lurker archivers and I wasn't blown away by its
 user interface.  That's apparently highly configurable though.

 I'd be inclined to agree wrt user interface. Documentation regarding
 this, and anything else to do with lurker, appears somewhat scarce -
 speaking as someone who has just migrated the exim.org lists to using
 lurker archiving. [previously we used mailman with the MHonArc/ 
 pipermail
 hybrid]

I noticed that!  There's no documentation link on the site.  I also  
saw your question regarding getting a message out of lurker given its  
message-id.  When I checked yesterday I didn't see a response.

 I am considering starting a set of pages within our wiki about use of
 lurker (we tend to cover almost everything else about mail so why not
 that).

That would be cool.  Feel free to add a link to your pages on the  
Mailman wiki, perhaps here:

http://wiki.list.org/display/DOC/Home

 Does anybody have contacts with the Lurker community that could  
 cross-
 post a new thread to get the discussion going?

 The ML appears... lacking in vigor..

 BTW lurker gives all messages an ID which is 3 parts separated by
 periods. The first part is a date field - ie 20070720, the second part
 is the receive time, UTC, as 6 digits, and the final part is some form
 of hex id. The nice part is if you quote just the first (or first 2)
 parts of message ID you get messages around that time...

Obviously Mailman can't know the second and third parts so it can't  
use them in its list copies.  I dislike using YYYMMDD because of the  
high number of collisions.

I should make clear that what I'm really proposing is not specific to  
Mailman or any particular archiver.  It's really an interface to a  
generic message store.  We succeed by convincing other mailing list  
software and archivers to adopt the same standard so that they can  
interoperate seamlessly.  We can perhaps have the first  
implementations of this defacto standard (any latent RFC shepherds  
out there? :).  We get everyone else to adopt it when we take over  
the world.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDGNHEjvBPtnXfVAQIwVQQAlwcmmuoXz/vKlpdu27wCHnfpwhhrQMmn
DWMEayuJsG+qg3GvkwyHGkgTBalENdDWWAQpPE9Zf9nmY24FyqhqRpe/QhOCajBV
4+lvXR1FARur4y4E9Lzcjz1TzX3lkaxx3dVCqpOtJxNVVvv442eYsLf11E3Z+wxY
m+ootMkR5pE=
=y4za
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:

 Barry Warsaw writes:

 Second, things can happen to a list
 that might cause this sequence number to get corrupted.

 Add an X-Mailman-Sequence-Number header if not already present.

 That doesn't deal with your other comments, but as I point out
 elsewhere, if you don't use *any* Mailman-specific information in the
 global ID, you have no sane way to handle collisions except throw them
 away (or make the global ID refer to a collection resource, but that's
 kinda unintuitive).

I'd probably call it X-List-Sequence-Number and I'd have to ensure  
that archive copy had that header in it.  OTOH, if I'm going to go to  
the trouble of adding this sequence number, why not just calculate a  
(more likely) gid for the message myself?  If I did that, I could use  
a tinyurl scheme and get much shorter urls.  The archiver would then  
be obliged to use my X-List-GID header verbatim.

I've been pushing for calculating this using non-Mailman headers  
because I'd /like/ for a client receiving the non-list copy to be  
able to make the same calculation.  OTOH, maybe we can have it both  
ways.

So, we calculate the sequence number and generate the following headers:

X-List-Sequence-Number: 801
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

The latter is composed of purely author generated data, the former is  
supplied by Mailman.

Assuming we also had this header:

List-Archive: http://archive.example.com/gid/

then the following url would point to the same exact resource:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801

If however we subsequently got a collision, then these two urls would  
address different resources.  E.g.:

X-List-Sequence-Number: 2112
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

Now the two messages would still be addressable by their respective  
urls:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112

but

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be a disambiguation page.  For a web u/i it would be an HTML  
list containing relative links to '801' and '2112'.  A RESTful XML  
document would contain the set of links to the subordinate pages.  A  
client of the archive.example.com service would have to be prepared  
to handle disambiguation pages if it used only the author generated  
GID, but it would be guaranteed that the full url would lead directly  
to one and only one email message.

Archives would have to recognize the X-List-Sequence-Number and honor  
it whenever it regenerated its archives so that the urls would remain  
stable.

Thinking about this more (and I've been up since about 3:30am so I'm  
a little foggy right now ;), we may want to optimize for fewer dupes  
rather than fewer collisions, or maybe it doesn't matter.  It would  
be interesting to see how big the message-id buckets are when only  
using the Message-ID header.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm
UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL
FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW
3KeGe2PkpaI=
=yhaZ
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham

On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
 BTW lurker gives all messages an ID which is 3 parts separated by
 periods. The first part is a date field - ie 20070720, the second
 part is the receive time, UTC, as 6 digits, and the final part
 is some form of hex id. The nice part is if you quote just the
 first (or first 2) parts of message ID you get messages around that
 time...

 Obviously Mailman can't know the second and third parts so it can't
 use them in its list copies.  I dislike using YYYMMDD because of the
 high number of collisions.

Its used as part of a UID, but has the nice feature of allowing easy
queries as to other messages at that time.

If the archiver is local you also have the information for part 2 of the
UID - lurker takes it from the From_ line.

Nigel.
--
[ Nigel Metheringham   [EMAIL PROTECTED] ]
[ - Comments in this message are my own and not ITO opinion/policy - ]



___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nigel,

On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote:

 On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
 BTW lurker gives all messages an ID which is 3 parts separated by
 periods. The first part is a date field - ie 20070720, the second
 part is the receive time, UTC, as 6 digits, and the final part
 is some form of hex id. The nice part is if you quote just the
 first (or first 2) parts of message ID you get messages around that
 time...

 Obviously Mailman can't know the second and third parts so it can't
 use them in its list copies.  I dislike using YYYMMDD because of the
 high number of collisions.

 Its used as part of a UID, but has the nice feature of allowing easy
 queries as to other messages at that time.

That should definitely be a way to traverse to the message, but it's  
not the message's global id (a.k.a. canonical address relative to the  
base url of the message store).  An archiver could provide other ways  
to traverse to the message, such as:

/[EMAIL PROTECTED]/ to see all messages by me
/[EMAIL PROTECTED]/mailman-developers/20070720 to see all messages by  
me today to this mailing list
/Subject?Improving%20the%20archivessort=thread to find all the  
messages in this thread regardless of when they were posted

etc.

 If the archiver is local you also have the information for part 2  
 of the
 UID - lurker takes it from the From_ line.

Mailman gets the From_ line before passing off to the archiver.  But  
that's interesting, does lurker /require/ the From_ line?

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDMInEjvBPtnXfVAQKJFAP/Y3FsBIXrSaRZ85eCl+pVTZxez2uRn0KB
2OMBV6vS/qC8K1R/myeGpBVr44yE/AfTa+kf+MLSlIlMpJdUlWDMWw2G90IPy1gv
t1VGrwbVPmOlLFxF8kIsi6NKIZpKoJrJVdQnSc+uPCqowIDU9FQ57+2hrH8HayTS
ISAZ0FTgAzk=
=sp+m
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Nigel Metheringham

On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
 Mailman gets the From_ line before passing off to the archiver.   
 But that's interesting, does lurker /require/ the From_ line?


Well lurker handles Maildir - no From_ but the same info is in the  
filename, and it can take messages on stdin without a From_ - at  
which point I guess its either faking it (from the headers) or making  
things up.

Nigel.

--
[ Nigel Metheringham   [EMAIL PROTECTED] ]
[ - Comments in this message are my own and not ITO opinion/policy - ]


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote:

 On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
 Mailman gets the From_ line before passing off to the archiver.
 But that's interesting, does lurker /require/ the From_ line?


 Well lurker handles Maildir - no From_ but the same info is in the
 filename, and it can take messages on stdin without a From_ - at
 which point I guess its either faking it (from the headers) or making
 things up.

Cool.  I wonder if lurker is compatible with Python 2.5's  
mailbox.Maildir implementation and whether the two could share the  
maildirs.  Thanks for the information!

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDRw3EjvBPtnXfVAQJHXwP/SiKhWiZ57thW84RBUWt9QVjf4KISEfRJ
H5lioRVPYYegiJp7rf/08TutkNsxGCHzRd/cdMEFXMkrCAdifLQ2QIdS4LRvEKyY
eRbVHcmxyAlwMbyUq36W+pcH2MutTM64HKNrbL9YRSTaLyMA11FnmaiGIK3RMnbM
AqtLGRSJ8Ec=
=D8oM
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-20 Thread Stephen J. Turnbull
Barry Warsaw writes:

  But it would have to be subject to the same bounce rules as any other  
  auto-response which could be used as a spam vector, e.g. limit the  
  number of bounces per time period and don't include the entire  
  original message in the bounce

But that prevents detecting a prematurely sent message, which is
presumably a common use case for genuine collisions.

I just don't think bouncing back is going to be very useful; either
you don't give the user the information he needs to figure out what
happened, or you give the spammers a vector.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-09 Thread Stephen J. Turnbull
John A. Martin writes:

  In the absence of a Message-ID
  on an outgoing mail message many if not most MTAs will add one.  Why
  not let Mailman anticipate the need to add a Message-ID when archiving
  the message rather than leaving it to the outgoing MTA?

Quite.

My reason for saying last resort is simply that this is not
predictable to third parties.  Eg, I send you (a non-subscriber) a
message with CC and no Message-ID.  You'd like to find the thread in
the archives.  You may as well just do a linear search on that month's
threads.

An URL based on an MD5 of the message body in theory would work, but
in the presence of non-ASCII bodies, structured MIME, ML digests, and
various MTA autoconversions, that seems fragile.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-07 Thread Paul Wise
On 7/3/07, Terri Oda [EMAIL PROTECTED] wrote:

 I'm trying to remember all the things people have suggested for the
 archives in the past so I can figure out what needs to be done and
 what might be nice to have, and see if this is doable in the time I
 have in the foreseeable future.

At lists.indymedia.org, we use a patch that provides these:

* stable URLs based on a generated message id
* URLs to the archived message in the message headers
* message hiding

http://lists.indymedia.org/patches/imc-10-mmid_hide_posts.patch

It poses a bit of a migration issue since all the existing mboxes may
or may not have the mmid header in them. We worked around that by
having an special place for the old archives.

We've been meaning to move to lurker for years, but haven't had the
human resources and also there were some showstoppers:

* public/private lists - lurker couldn't do that properly when we looked
* lack of date-based index to the archives
* general navigation issues; stuff like linking between current thread
and nearby ones
* mailto links (has now been fixed)
* the migration nightmare

My personal opinion is that pipermail should be removed and mailman
should not contain a default archiver since there are plenty of good
archivers already (lurker, mhonarc etc). Adding wrappers around them
would be simpler than reimplementing them.

-- 
bye,
pabs

https://docs.indymedia.org/view/Main/PaulWise
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-05 Thread John Dennis
On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
 
  Since I've largely finished up the coding contract that was eating up
  a lot of my time, I'm thinking that I'd like to do some coding for
  fun.  And nothing says fun like trying to fix the Mailman archives! ;)
 
 That would be awesome Terri!  It's an aspect of Mailman that sorely  
 needs attention, and you will gain (even more) fame and fortune by  
 working on it. :)  I totally support this effort.

A little over a year ago I went on a search to find the best open source
archiver and at that time I came up with Lurker
(http://lurker.sourceforge.net) Since then I believe Lurker has seen a
major new revision. I also believe Lurker is the archiver used by
Debian.

So if you want to leverage existing open source archiving or at least
look at an example of what would be necessary to allow easy easy
external archiving integration with Mailman you might want to look at
Lurker.
-- 
John Dennis [EMAIL PROTECTED]


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-05 Thread Terri Oda
On 5-Jul-07, at 12:09 PM, John Dennis wrote:
 A little over a year ago I went on a search to find the best open  
 source
 archiver and at that time I came up with Lurker
 (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
 major new revision. I also believe Lurker is the archiver used by
 Debian.

I was hoping someone would post that link!  Lurker was best of breed  
last time I was looking, and I'd definitely like to see what we can  
leverage there.

  Terri


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Stephen J. Turnbull
Barry Warsaw writes:
   - archive links that won't break if the archive is rebuilt
  
  Yes, this is absolutely critical, in fact, I'd put it right at the  
  top of the list, even more so than a u/i overhaul.  Stable urls, with  
  backward compatible redirecting links if at all possible, would be  
  fantastic.

+1.  I've been wanting to do something about this, and have made
proposals (not back with code, mea maxima culpa) for design.  I would
definitely be happy to help with this, but given time constraints, it
would be nice if somebody else could take the lead.

  Along with that, I would really like to come up with an algorithm for  
  calculating those urls without talking to the archiver.

Brad didn't like this when I suggested it before, but I didn't really
understand why not.  Anyway, FWIW:

I suggest adding an X-List-Received-ID header to all messages.  I
haven't really thought through whether the UUID in that field should
be at least partly human-readable or not, but that doesn't matter for
the basic idea.[1]  The on-disk directory format would be

/path-to-archive/private/my-list/Message-ID

for singletons (Message-ID is the author-supplied ID) and

/path-to-archive/private/my-list/Message-ID/List-Received-ID

for multiples.  These would be created on-the-fly when they occur.
They can be served as static pages.  For almost all messages, the bare
URL

http://archives.example.com/my-list/Message-ID

should Just Work (ie, return a no-such-object result or a single
message).  Where it does not, you get an index of all pages with that
message ID.

The main drawback to using Message IDs that I can see is that broken
MUAs may supply no Message-ID, or the same one repeatedly.  In the
former case, as a last resort Mailman can supply one, but that won't
help people who get a personal copy and want to find the thread.
However, I see no way to help them, anyway, beyond a generic archive
search engine.  In the latter, you get lots of messages matching the
Message-ID, and while most lists should have *zero* problems, a list
that has any instances of this problem would have many.  Again I can't
see a good way to deal with this other than a general search facility,
as computing a digest of headers or content is hard to do reliably.
Providing an index of matching posts seems like a reasonable approach,
which can be efficiently implemented (eg, as static pages).
Furthermore, the examples I've seen of both in the last few years have
all been either spam or (in the case of duplicate Message-IDs) actual
duplicates due to some mail system problem or itchy user fingers.

A minor drawback to my proposal is that if a message gets archived as
a singleton for that Message-ID, then a duplicate arrives, previously
created references in the archive will of course now return an index
rather than the desired message.  Ie, there is data corruption.  This
can be dealt with in several ways; the easiest would be to provide a
if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me
link when creating the directory for multiple instances.

There's also a *very* minor benefit: repeat sends will be immediately
recognizable without checking Message-ID.

Footnotes: 
[1]  By partly human-readable I mean containing list-id and date
information.  The idea would be to have the date come first, so that
users would have a shot at identifying which of several messages is
most likely, and this would be searchable by eye with simply an
ordinary sorted index.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread John A. Martin
 st == Stephen J Turnbull
 Re: [Mailman-Developers] Improving the archives
  Wed, 04 Jul 2007 16:49:58 +0900

st The main drawback to using Message IDs that I can see is that
st broken MUAs may supply no Message-ID, or the same one
st repeatedly.  In the former case, as a last resort Mailman can
st supply one,

If the archive is considered to be a reflection of what Mailman _put_
on the wire, as distinct from what was received from the wire, then
adding a Message-ID in the absence one already present is a reflection
of a SHOULD requirement of rfc(2)822.  In the absence of a Message-ID
on an outgoing mail message many if not most MTAs will add one.  Why
not let Mailman anticipate the need to add a Message-ID when archiving
the message rather than leaving it to the outgoing MTA?

jam


pgpQL0SZvNpJX.pgp
Description: PGP signature
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp

Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Dale Newfield
I'm all for someone taking ownership of this long-neglected component -- 
thank you for doing so!

Barry Warsaw wrote:
 Maybe a way to think about this is that the canonical url is based on  
 the message-id, but then there's some way to distill even this down  
 to a tinyurl or simple integer that would be stable in the face of  
 full archive regenerations.

The resistance to basing this on message-id has always been that there's 
no guarantee of uniqueness...
...but I believe each list has some sort of counter for how many 
messages it's seen, so we could add another header with that number, and 
use as a unique id the two concatenated together...
(That way the archiver can know from the content of the header exactly 
how to generate the same unique id as mailman, which would allow for the 
url-in-the-footer to happen w/o first hitting the archiver.)

Just throwing out ideas,
-Dale
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Jeff Breidenbach
Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer that would be stable in the face of
full archive regenerations.

I'd suggest the reverse. Keep the canoncical archive URL short and
sweet, and then use a URL redirection service to map message-id's
to those URLs. It is the archiver's job to make it all work. For example,
the canonical  archive URL might stay exactly the way it is in pipermail.
But the archival link embedded in the message would instead go
to a redirection service.

http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html
http://mail.codeit.com/[EMAIL PROTECTED]

The one other thing I'd ike to revisit is integration with third party
archival services. There are two obvious integration points; one is a
button in the Mailman list admin user interface that says archive with
service X not unlike the setting in Firefox that basically says search
with service X. The other integration point is the archival link
discussed above. In which case it would be set to something like.

http://third-party-service/[EMAIL PROTECTED]

Disclosure: I help run a third party archiving service, and this topic was
discussed quite a bit previously.  [1] Nonetheless it seems like a good
time revisit given the current discussion about archive wishlists.

[1] http://www.mail-archive.com/mailman-developers@python.org/msg08772.html
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-04 Thread Jeff Breidenbach
In which case [the message body link] would be set to something like.

http://third-party-service/[EMAIL PROTECTED]

Just for fun, I did a trial implementation. It works, but the URLs are
too long.
For example, the URL below spends 59 characters on the messag-id, and
27 characters on the listname. We're  already over my comfort level (of
about 72 characters) and haven't even started to count the hostname, and
other URL-lengthening overhead. Maybe this was a bad idea after all.

http://www.mail-archive.com/search?l=mailman-developers%40python.org[EMAIL 
PROTECTED]

Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Steve Huston
I'll admit to not having read previous discussions on this topic, but
I'll also add my 2 insert-lowest-denomination-coin here:

On 7/2/07 11:06 PM, Terri Oda wrote:
 - better address obfuscation (maybe by generating pages through cgi)

I run a few Wordpress sites, and there's a plugin I use called
PHPEnkoder which does a good job of this.  It basically wraps the
address around a little bit of Javascript; if you have Javascript turned
on in the browser, it's seamless, and if not you see Javascript
required to view address or something like that.  The theory is that
bots and such don't run JS, so it's safe from harvesting.  I'll leave
it to the list as to how true an assessment this is, but it Works For Me :

  * Add a search option

I know there's been patches around forever that integrate ht://Dig with
Pipermail; maybe some way to do this, while still making it an option
that can be tuned?  If ht://Dig is there and you turn on the option, it
works, but if it's not then it's not required?  This would satisfy the
not adding a billion dependencies, but may be overkill as well.  I'll
also happily admit to not knowing much about the cost of search engines
to a system.

  * MUAs usually make URLs clickable. An new Archive could be used  
 when posts are distributed, in the footer, so that each message has a  
 link to the whole thread in the Archive.

This would be a Godsend.  A group at work here runs an old homebrewed
exploder, and a few years ago I tried to convert them to Mailman.  They
liked everything they saw, up until the point where they couldn't refer
to some kind of short and simple message number, and get right to that
message in the archive.  The current system generates a number based on
a simple incrementing index of the list, and many months after a mailing
people will refer to message #483, and know they can view it at
http://hostname/foo/listname/483.html - which is also posted in the
footer of the message sent out.  Of course, if the archives were based
on Message-ID headers, this may make such a number a bit unwieldly, but
if it were some kind of simple-ish system I might finally get rid of
those old lists :

-- 
Steve Huston - W2SRH - Unix Sysadmin, Dept. of Astrophysical Sciences
  Princeton University  |ICBM Address: 40.346525   -74.651285
126 Peyton Hall |On my ship, the Rocinante, wheeling through
  Princeton, NJ   08544 | the galaxies; headed for the heart of Cygnus,
(609) 258-7375  | headlong into mystery.  -Rush, 'Cygnus X-1'
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:

 Since I've largely finished up the coding contract that was eating up
 a lot of my time, I'm thinking that I'd like to do some coding for
 fun.  And nothing says fun like trying to fix the Mailman archives! ;)

That would be awesome Terri!  It's an aspect of Mailman that sorely  
needs attention, and you will gain (even more) fame and fortune by  
working on it. :)  I totally support this effort.

 I'm trying to remember all the things people have suggested for the
 archives in the past so I can figure out what needs to be done and
 what might be nice to have, and see if this is doable in the time I
 have in the foreseeable future.

 The big things people wanted most, if I recall correctly, included:

 - modernized HTML/CSS/Themes (preferably to match a modernized web
 interface... is that all set up now?)

It's not, but Andrew Kuchling will be working on this.  I haven't yet  
revealed detailed plans, though I'm working on an email about this  
over the U.S. July 4th holiday.  But I suppose it's time for a quick  
summary: I'd like to get a Mailman 2.2 out with an updated u/i sooner  
rather than later, and if possible an updated archiver would be one  
of those few other new features that I think could go into a 2.2.   
OTOH, it would be fine if we pushed that off to Mailman 3 too, but it  
leveraged all the u/i work to be done in 2.2.

 - archive links that won't break if the archive is rebuilt

Yes, this is absolutely critical, in fact, I'd put it right at the  
top of the list, even more so than a u/i overhaul.  Stable urls, with  
backward compatible redirecting links if at all possible, would be  
fantastic.

Along with that, I would really like to come up with an algorithm for  
calculating those urls without talking to the archiver.  This would  
allow the list delivery queue to calculate the List-Archive: header  
value and any message header/footer substitutions before the message  
hits the archiver.

 - better address obfuscation (maybe by generating pages through cgi)

I'd still love to do this, and I think were it not for crawlers, we  
could get a lot of mileage out of creation on demand and caching.   
But how do you handle Google crawling your archive?

 - search

Another huge huge feature.

 - not adding a billion dependencies to Mailman

Definitely.  I'm also not opposed to changing the interface between  
Mailman and the archivers if necessary.

 Here's the list from the wiki's Mailman 2.2 page: http://
 wiki.list.org/display/DEV/Mailman+2.2

We should probably start a separate archiver wiki page.  I plan on re- 
organizing the 2.2 page anyway, so I'll probably end up doing that if  
you don't get around to it before me wink.

 (1) Is anyone working on this already?

Not that I know of.

 (2) What else is on people's wish lists for a pipermail replacement?

Other things high on my list are ditching the crufty storage  
currently being used (pickles begone!), an RSS feed, and a 'message  
storage' which could be used to vend archived messages through other  
delivery transports, such as imap or nntp.  But I'd be willing to put  
all that off for stable urls, an updated u/i, and searching.

Anything I can do to help, please let me know.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRorkOHEjvBPtnXfVAQLw0wP/TFgXxFAcK+3QiDG4jkyPCVVpP0EqATwB
nYfUDrf0ytuTphFMM4gJmWbZdtR1HJ2xqNOit18QTsM/pjTiIDB++nH0IoRkRwy3
qs4JdBb+m3Amuxaaa4dQp+nWQt2yUMsF/HWp3BS/vx8oCfkjMhOKDI29/UG9jU+L
L64QzWeywGw=
=ewlo
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp


Re: [Mailman-Developers] Improving the archives

2007-07-03 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Steve makes me think of a couple of other wish list items.

On Jul 3, 2007, at 7:36 AM, Steve Huston wrote:

 On 7/2/07 11:06 PM, Terri Oda wrote:
 - better address obfuscation (maybe by generating pages through cgi)

 I run a few Wordpress sites, and there's a plugin I use called
 PHPEnkoder which does a good job of this.

I have this idea that you could gateway messages from an archive or  
mailing list to and from a bulletin board forum.  Maybe this doesn't  
fall within the scope of the archiver because I could see a 'forum  
queue' like we have an nntp queue, but in that case, being able to  
calculate an archive url without talking to the archiver becomes  
important again.  It would be nice in that case to put a link to the  
archive message in the forum post.

  * MUAs usually make URLs clickable. An new Archive could be used
 when posts are distributed, in the footer, so that each message has a
 link to the whole thread in the Archive.

 This would be a Godsend.  A group at work here runs an old homebrewed
 exploder, and a few years ago I tried to convert them to Mailman.   
 They
 liked everything they saw, up until the point where they couldn't  
 refer
 to some kind of short and simple message number, and get right to that
 message in the archive.

This reminds me, I would love to have a link in an archive message  
that I could click to get the message sent to me, as it originally  
appeared on the mailing list.  If I had that, I'd never need to  
locally save another mailing list post.  I'd just search for the one  
I wanted, go to the archive, click on the send it to me link, then  
do a normal reply in my mail reader.

 The current system generates a number based on
 a simple incrementing index of the list, and many months after a  
 mailing
 people will refer to message #483, and know they can view it at
 http://hostname/foo/listname/483.html - which is also posted in the
 footer of the message sent out.  Of course, if the archives were based
 on Message-ID headers, this may make such a number a bit unwieldly,  
 but
 if it were some kind of simple-ish system I might finally get rid of
 those old lists :

This would be possible with today's system, but it leads to unstable  
urls, especially when you consider archive scrubbing (which, come to  
think of it, is another wish list item ;).  We'd like for an admin to  
be able to easily pull an archive message, but it's even worse than  
that.  Sometimes an admin has to scrub the actual backing message  
store (e.g. today's mbox file).  This will change the message counts  
and thus the incremental indexes.

Maybe a way to think about this is that the canonical url is based on  
the message-id, but then there's some way to distill even this down  
to a tinyurl or simple integer that would be stable in the face of  
full archive regenerations.

- -Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRormRHEjvBPtnXfVAQIHYwP/fLnY/pebRlhrFeUpPJu5VfZNyR24oLId
qjZ4F2MHW25LcemvGzpeUSgXRQJk2LQIQKSlYYtTM+8xcStey4IvDnPLmzX5MQOC
xiI9PznZHdLmbF9SaUDZQZBRKZhqCNeslZ5zpnN35KStL3NlTc6PkBylzIC7Y47F
a3RxMEOgMaA=
=HM9I
-END PGP SIGNATURE-
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=showfile=faq01.027.htp