Craig Loomis writes:
>Globally unique IDs, hashed IDs, etc., are very appealing from
> various CS-y and techie points of view, but are simply not memorable
> to humans or knowable by dumb external programs. I think as much, or
> more, effort should be put into delivering a straightfo
>but if you can trust yourself to generate them, consecutive
>integers provide minimal, order-preserving, perfect hashing, too!
Hmm this sounds pretty sensible to me.
Jeff
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail
Or Re: [Mailman-Developers 10417] Improving the archives
I would like to interject and highlight some use cases for stable
and predictable IDs. For us, "message IDs" are directly used both by
people and ignorant programs. Our mailing lists serve as a permanent
and concise record of ou
--On 2 October 2007 22:47:35 -0400 Barry Warsaw <[EMAIL PROTECTED]> wrote:
> One question: should the angle brackets on the Message-ID be part of
> the hash or not? I think they should, or IOW, the entire value of
> the Message-ID header is taken as the hash, though they should be
> stripped o
Question: what about crossposted messages?
Let's say a message gets sent to a list called mailman-developers
with a CC to a list called pet-bunnies. Hypothetically, of course.
Presumably, the person who got the message from pet-bunnies
should probably end up at the pet-bunnies archive, where the
m
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote:
> Jeff Breidenbach wrote:
>> 5.85 million messages
>
>> That's 0.03% if you count all the messages. It is 0.008% if you
>> discard the top three offenders, all of which I have contacted.
>
> I'd say tha
Jeff Breidenbach wrote:
> 5.85 million messages
> That's 0.03% if you count all the messages. It is 0.008% if you
> discard the top three offenders, all of which I have contacted.
I'd say that's a strong argument for just using the Message-ID and
simplifying this tremendously...
...Barry, do yo
> What we really want to know is how many (non-empty) Message-ID
> collisions are there that *don't* share a Date? This is the number of
> messages that only-messageid loses, and that the composite identifier
> method would not lose.
I took a look at a larger dataset, 5.85 million messages from s
> 704 messages fall into this category. Of these, 596 come from a
> single (malfunctioning and duplicate spewing) list server. I have
> not yet examined the remaining 208 messages, but I'll bet anything
> many also have duplicate message bodies. Or are spam. So for this
> data set, we have an upper
> What we really want to know is how many (non-empty) Message-ID
> collisions are there that *don't* share a Date? This is the number of
> messages that only-messageid loses, and that the composite identifier
> method would not lose.
It took longer than expected, but I now have numbers from
looki
> If you are relying on the sender to do the right thing, then
> why not force them to create proper message-ids?
I think Barry's proposal is essentially a numbers game - e.g.
he's hoping for significantly better results using "Date" in
the calculation than not using it.
http://wiki.list.org/disp
Jeff Breidenbach wrote:
> So I just looked at 2 million raw messages from 2007, spread over
> a few thousand mailing lists (all data is from mail-archive.com). My
> first question was - when comparing only with messages from the
> same list - how many times do I see a repeated message-id? The
> ans
> If you improve the script or find numbers that lead to different
> conclusions, now's the time to know!
Live and learn!
So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only
Barry Warsaw writes:
> Yes, definitely. What do you think of the base32 examples I have on
> the wiki page?
They're somewhat better than Message-IDs for readability, but they're
not user-friendly.
> On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
>
> > It seems silly to generate nice shor
Barry Warsaw writes:
> I agree, I just don't think message-ids are user friendly enough to
> be this canonical url. Especially in this context, which is exactly
> where urls are thrown in users faces. An archiving service is
> exactly the right place for redirecting human readable urls
Hi,
I think this is the first time that I'm posting here but hopefully
not the last. Thanks to everyone involved for an incredible project.
I'm not much of a developer but I like practical solutions and will
do everything possible to help improve in this area even if it's
just to give some feedbac
> Guarantee is a pretty strong word. A malicious person could post two
> messages with the same message-id, same date, but different bodies.
This is my concern too. Especially since this is known information; it is
trivial to be malicious. Whatever was done, I think would *have* to deal
with '
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:
>> What you gain from my proposal over a pure Message-ID approach
>> is guaranteed uniqueness given the list copy
>
> Guarantee is a pretty strong word. A malicious person could post two
> messages
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote:
>>> So we just specify a header to put it in, and subscribers will be
>>> able
>>> to use it, per definition of a canonical URL.
>>
>> It is the archive server's job to decide what is the "can
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote:
>> Regardless of whether we *need* to generate our own unique ID, I'm
>> leaning towards the thought that we're going to *want* to generate
>> our own for usability reasons. In a perfect world, i t
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
> On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
>>> So we just specify a header to put it in, and subscribers will be
>>> able
>>> to use it, per definition of a canonical URL.
>> It is the archive se
> What you gain from my proposal over a pure Message-ID approach
> is guaranteed uniqueness given the list copy
Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the arc
Jeff Breidenbach writes:
> >So we just specify a header to put it in, and subscribers will be able
> >to use it, per definition of a canonical URL.
>
> It is the archive server's job to decide what is the "canonical" URL
> for a message. There's a good chance these archival URLs will be
> s
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:
>> What complexity? Mailman just does
>>
>> msg['X-List-Archive-Received-ID'] = Email.msgid()
>
> Easy to introduce, harder to deal with. The archival server would now
> keep track of both the me
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:
> I simply think we should be prepared for applications where relying on
> the sender to supply a UUID is not acceptable; we need to be able to
> provide one ourselves. Creating UUIDs is a solve
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:
> Which brings me to suggestion #2, which is go ahead and write
> an RFC on how list servers should embed archival links in messages.
> This sounds like an internet wide interoperability issue as mu
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:
> On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
>> I've looked at a few lurker archivers and I wasn't blown away by its
>> user interface. That's apparently highly configurable though.
>
> I've been doin
> Regardless of whether we *need* to generate our own unique ID, I'm
> leaning towards the thought that we're going to *want* to generate
> our own for usability reasons. In a perfect world, i think we'd have
> a sequence number so I could visit http://example.com/mailman/
> archives/listname/204.
On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
>> So we just specify a header to put it in, and subscribers will be
>> able
>> to use it, per definition of a canonical URL.
> It is the archive server's job to decide what is the "canonical" URL
> for a message. There's a good chance these arch
Jeff Breidenbach wrote:
> In addition, Barry was talking about concocting a unique
> identifier from the Date field and Message-ID. I'm not a big fan of
> this idea, because the date field comes from the mail user agent
> and is often wildly corrupt; e;g; coming from 100 years in the future.
Oh--I
There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.
>So we just specify a header to put it in, and subscribers will be able
>to use it, per de
John A. Martin writes:
> >> better to go ahead and use the mesage-id, rather than concoct
> >> yet another "this time we mean it!" unique identifier.
>
> st> That's not the point. We're not going to impose this on
> st> senders;
>
> I read the quote as meaning "this time
>>>>> "st" == Stephen J Turnbull
>>>>> "Re: [Mailman-Developers] Improving the archives"
>>>>> Tue, 24 Jul 2007 15:56:35 +0900
st> Jeff Breidenbach writes:
>> > Notice that of 325146 total messages, 624 of t
Jeff Breidenbach writes:
> > Notice that of 325146 total messages, 624 of them had no message-id
> > header. Even if you aggregate dup+col, you're still looking at a
> > total duplicate rate of 0.29%.
>
> Message ID's are supposed to be unique.
Fortunately, a rule more honored in the obser
> Notice that of 325146 total messages, 624 of them had no message-id
> header. Even if you aggregate dup+col, you're still looking at a
> total duplicate rate of 0.29%.
Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
Terri Oda wrote:
> I've been doing a lot of thinking about interface, and I'm coming to
> the conclusion that something more like a web bulletin board is
> probably the way to go
For public lists, the answer may lie in external tools like nabble.com
or mailinglistarchive.com
Of course, that
On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
> I've looked at a few lurker archivers and I wasn't blown away by its
> user interface. That's apparently highly configurable though.
I've been doing a lot of thinking about interface, and I'm coming to
the conclusion that something more like a web
On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote:
> Cool. I wonder if lurker is compatible with Python 2.5's
> mailbox.Maildir implementation and whether the two could share the
> maildirs. Thanks for the information!
It had better be -- Maildir has a published specification. If
Barry Warsaw writes:
> But it would have to be subject to the same bounce rules as any other
> auto-response which could be used as a spam vector, e.g. limit the
> number of bounces per time period and don't include the entire
> original message in the bounce
But that prevents detecting
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote:
> On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
>> Mailman gets the From_ line before passing off to the archiver.
>> But that's interesting, does lurker /require/ the From_ line?
>>
>
> Well lur
On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
> Mailman gets the From_ line before passing off to the archiver.
> But that's interesting, does lurker /require/ the From_ line?
>
Well lurker handles Maildir - no From_ but the same info is in the
filename, and it can take messages on stdin wit
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Nigel,
On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote:
> On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
>>> BTW lurker gives all messages an ID which is 3 parts separated by
>>> periods. The first part is a date field - ie 20070720, the sec
On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
>> BTW lurker gives all messages an ID which is 3 parts separated by
>> periods. The first part is a date field - ie 20070720, the second
>> part is the receive time, UTC, as 6 digits, and the final part
>> is some form of hex id. The nice part is if y
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote:
>
> On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
>> I've looked at a few lurker archivers and I wasn't blown away by its
>> user interface. That's apparently highly configurable though.
>
> I'd
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:
> Barry Warsaw writes:
>
>> Second, things can happen to a list
>> that might cause this sequence number to get corrupted.
>
> Add an X-Mailman-Sequence-Number header if not already present.
>
>
On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
> I've looked at a few lurker archivers and I wasn't blown away by its
> user interface. That's apparently highly configurable though.
I'd be inclined to agree wrt user interface. Documentation regarding
this, and anything else to do with lurker, app
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:
>> How likely is it that two messages with the same message-id and
>> date are /not/ duplicates?
>
> For message id generators that include a time-stamp in the generated
> id, approximately the s
Barry Warsaw writes:
> Second, things can happen to a list
> that might cause this sequence number to get corrupted.
Add an X-Mailman-Sequence-Number header if not already present.
That doesn't deal with your other comments, but as I point out
elsewhere, if you don't use *any* Mailman-specif
Barry Warsaw writes:
> First, I want to avoid talking about file system layout. To me,
> that's an implementation detail we needn't worry about right now.
Agreed.
> How likely is it that two messages with the same message-id and
> date are /not/ duplicates?
For message id generators t
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote:
> John A. Martin writes:
>
>> In the absence of a Message-ID
>> on an outgoing mail message many if not most MTAs will add one. Why
>> not let Mailman anticipate the need to add a Message-ID whe
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 8, 2007, at 1:06 AM, Paul Wise wrote:
> My personal opinion is that pipermail should be removed and mailman
> should not contain a default archiver since there are plenty of good
> archivers already (lurker, mhonarc etc). Adding wrappers around
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 5, 2007, at 12:09 PM, John Dennis wrote:
> A little over a year ago I went on a search to find the best open
> source
> archiver and at that time I came up with Lurker
> (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
>
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote:
>> Maybe a way to think about this is that the canonical url is based on
>> the message-id, but then there's some way to distill even this down
>> to a tinyurl or simple integer that would be stable
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote:
> Barry Warsaw wrote:
>> Maybe a way to think about this is that the canonical url is based on
>> the message-id, but then there's some way to distill even this down
>> to a tinyurl or simple integer th
John A. Martin writes:
> In the absence of a Message-ID
> on an outgoing mail message many if not most MTAs will add one. Why
> not let Mailman anticipate the need to add a Message-ID when archiving
> the message rather than leaving it to the outgoing MTA?
Quite.
My reason for saying "last
On 7/3/07, Terri Oda <[EMAIL PROTECTED]> wrote:
> I'm trying to remember all the things people have suggested for the
> archives in the past so I can figure out what needs to be done and
> what might be nice to have, and see if this is doable in the time I
> have in the foreseeable future.
At lis
On 5-Jul-07, at 12:09 PM, John Dennis wrote:
> A little over a year ago I went on a search to find the best open
> source
> archiver and at that time I came up with Lurker
> (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
> major new revision. I also believe Lurker is the a
On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
>
> > Since I've largely finished up the coding contract that was eating up
> > a lot of my time, I'm thinking that I'd like to do some coding
>In which case [the message body link] would be set to something like.
>
>http://third-party-service/[EMAIL PROTECTED]
Just for fun, I did a trial implementation. It works, but the URLs are
too long.
For example, the URL below spends 59 characters on the messag-id, and
27 characters on the listnam
>Maybe a way to think about this is that the canonical url is based on
>the message-id, but then there's some way to distill even this down
>to a tinyurl or simple integer that would be stable in the face of
>full archive regenerations.
I'd suggest the reverse. Keep the canoncical archive URL shor
I'm all for someone taking ownership of this long-neglected component --
thank you for doing so!
Barry Warsaw wrote:
> Maybe a way to think about this is that the canonical url is based on
> the message-id, but then there's some way to distill even this down
> to a tinyurl or simple integer t
>>>>> "st" == Stephen J Turnbull
>>>>> "Re: [Mailman-Developers] Improving the archives"
>>>>> Wed, 04 Jul 2007 16:49:58 +0900
st> The main drawback to using Message IDs that I can see is that
st> broken MUAs may s
Barry Warsaw writes:
> > - archive links that won't break if the archive is rebuilt
>
> Yes, this is absolutely critical, in fact, I'd put it right at the
> top of the list, even more so than a u/i overhaul. Stable urls, with
> backward compatible redirecting links if at all possible, wo
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Steve makes me think of a couple of other wish list items.
On Jul 3, 2007, at 7:36 AM, Steve Huston wrote:
> On 7/2/07 11:06 PM, Terri Oda wrote:
>> - better address obfuscation (maybe by generating pages through cgi)
>
> I run a few Wordpress sites,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
> Since I've largely finished up the coding contract that was eating up
> a lot of my time, I'm thinking that I'd like to do some coding for
> fun. And nothing says fun like trying to fix the Mailman arch
I'll admit to not having read previous discussions on this topic, but
I'll also add my 2 here:
On 7/2/07 11:06 PM, Terri Oda wrote:
> - better address obfuscation (maybe by generating pages through cgi)
I run a few Wordpress sites, and there's a plugin I use called
PHPEnkoder which does a good j
66 matches
Mail list logo