Re: Archiving by month

1999-09-09 Thread Jeff Breidenbach


Hi Earl,

I think best solution is for HtDig configuration to become more
flexible, and that is part of their development plan. If there are
problems in the meantime, I'll probably go ahead and patch
MHonArc. That's probably better asthetically than cluttering MHonArc's
feature space with the workaround.

Jeff



Re: Archiving by month

1999-09-09 Thread Earl Hood

On September 8, 1999 at 05:06, Jeff Breidenbach wrote:

> 4) Htdig will now index the attachments. I didn't really want this, but
>I also don't want the administrative headache of running a patched
>MHonArc or a patched htdig, which is required for this sort of 
>functionality. I'll revisit this either when a future cersion of
>htdig becomes available, or if it turns out to be a problem.

I can add back the .dir extension to the mhexternal.pl filter.  I can
add the option "nosubdirext" for it so users that do not want it
have a way to remove the extension w/o patching the code.  I know the
original reason, as mentioned by a user, for removing the .dir extension
was due to the stupidity of other software, but I figured there would
be no harm to remove the .dir.

I can create a minor update release of MHonArc, which will include
some other modifications beside the .dir thing, in the next couple
of days if you desire.

--ewh



Re: Archiving by month

1999-09-08 Thread Jeff Breidenbach


So what's going on?

1) Paul's latest date patches have been applied. (Thanks, Paul!)

2) A new rcfile for [EMAIL PROTECTED] has been added, and the
   misdated raw email has been erased. The list will be rebuilt,
   but not until I take care of a disk which is now at 99%
   capacity.

3) A bunch of names were added to the bottom of the FAQ for
   acknowledgements, but I'm sure I forgot several. Who did
   I forget to include?

4) Htdig will now index the attachments. I didn't really want this, but
   I also don't want the administrative headache of running a patched
   MHonArc or a patched htdig, which is required for this sort of 
   functionality. I'll revisit this either when a future cersion of
   htdig becomes available, or if it turns out to be a problem.

5) For those interested, the service is now getting around 30,000 page
   views today. While much of that was altavista and friends
   doing indexing, a reasonable chunck is real people finding 
   information that they needed. It is a very good feeling knowing
   that your work is being used.




Re: [htdig] Re: Archiving by month

1999-09-07 Thread Geoff Hutchison

At 10:18 PM -0500 9/6/99, Jeff Breidenbach wrote:

>If I could say "Ignore everything that does not end in .html" or
>"only index URLs with a certain regexp" that would do the trick.
>But with the current configuration options, I just don't see how to do
>this.

There's a patch in the patch archive for restricting based on *including*
only certain extensions. I can't remember the URL offhand...

The 3.2 codebase has both this, as well as full rexep for restricting
indexing and searching. However, I would prefer not to backport anything
since we're nearing a 3.1.3 release, as well as a 3.2.0b1 release (the
former in the next few days, the latter probably by the end of the month).


-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/




Re: Archiving by month

1999-09-06 Thread Jeff Breidenbach


Hi htdig folks,

I'm having a bit of a problem getting what I want from the htdig
configuration options. Lots of people, myself included, use htdig in
conjunction with MHonArc. In the current release version of MHonArc
(2.4.3, which I recently upgraded to) attachments may be stored in
subdirectories as following:

The first URL is the message, while the second is the attachment.
No need to follow the links, just look at their structure.

http://mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174.html
http://mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174/The_state_i_am_in.txt

My question is, using the current stable version of htdig, how
can I configure it to ONLY index messages, and not index attachments?
If I could say "Ignore everything that does not end in .html" or
"only index URLs with a certain regexp" that would do the trick. 
But with the current configuration options, I just don't see how to do
this. 

Thanks in advance for enlightenment.

Jeff




Re: Archiving by month

1999-09-06 Thread Earl Hood

On September 5, 1999 at 13:49, Jeff Breidenbach wrote:

> Paul, see how attachments end up in subdirectories, for example
> http:[EMAIL PROTECTED]/msg00459.html
> The default rcfile puts attachments in a subdirectory, with a .dir
> extension. You are probably overriding the MIMEArgs directive, or
> perhaps .html attachments are treated differently.

I think the problem is a change in mhexternal.pl of MHonArc in
the naming of attachment subdirectories.  It appears I did not
mention it in CHANGES, but here is the SCCS delta comment on it:

D 2.7 99/06/25 13:59:18+05:00 [EMAIL PROTECTED] 22 21  3/3/228
P /home/ehood/work/perl/MHonArc/lib/mhexternal.pl
C Removed addition of ".dir" to subdir.

According to the date, it was applicable for v2.4.0 or v2.4.1.  I
cannot remember the exact reason for the change, but some user had
problems with the ".dir" so I figured no harm (ha ha) would occur
if I removed the ".dir".

I do not know how htdig works, but can it index specified list of
file types (eg: .html, .txt), or can you specify a regex/glob mask
(or match) to control indexing?

--ewh



Re: Archiving by month

1999-09-06 Thread PS Mitchell

Hi Jeff,

I'm sorry you had problems with my previous date code - hope it hasn't
caused you much grief.  I just didn't realise there were mail clients out
there which produced malformed date strings (well, ones that date(1) can't
read at least).

The last patch to mailme I sent was (again) incorrect as it didn't
take care of multiple Received: lines, and the junk on a Received line
before the semicolon.  I've attached a patch (again from your original
source) which deals with this (it uses the top Received: line which I think
is safest), and also skips the date code completely unless it's a monthly
list, just so that any problems don't affect non-monthly lists at all.  But
providing x-archive-with-list is formatted ok, or Received:  lines are
sensible (which come from your server so they are), malformed dates
hopefully should be a thing of the past.

With regard to attachments: I copied over your own MIMEargs (just added a
"target="), and I've gone back and regenerated my archives here with your
rcfile alone: and yes, I think it's with text/plain attachments that the
msg* subdirectory gets created.  I don't know why they don't tack
.dir on the end for these attachments, at least here, but I'm sure
it's not a worry.

Paul


--- mailme.jeff Mon Sep  6 21:17:09 1999
+++ mailme  Mon Sep  6 21:01:19 1999
@@ -103,6 +103,25 @@
 } '
 }
 
+# Special case of grab() for Received: lines
+# This is because we want to take first occurrence
+# Example usage: cat messageheaders | receiveme
+receiveme() {
+$NAWK '
+BEGIN {
+m = "^Received:"
+}
+{
+if (match($0, m)) {
+print $0
+getline
+while ( $0 ~ /^[ \t]+/) {
+print $0; getline
+}
+}
+} '
+}
+
 # Get all email addresses, and precede them with a carat.
 # Example usage: cat RMAIL | waterfall
 waterfall () {
@@ -318,10 +337,6 @@
 X13=`echo "$T" | grab "mailing-list"` #use later
 X14=`echo "$T" | grab "list-post"`
 
-# If indexing by month, we care about the date
-DATE=`echo "$T" | grab "date"`
-JUSTDATE=`echo $DATE | sed 's/^date: //i'`
-
 # Extract email addresses
 CHANCE=$(echo $TO $CC $X1 $X2 $X3 $X4 $X5 $X6 $X7 $X8 $X9 $X10 $X11 $X14 |\
 waterfall)
@@ -453,6 +468,29 @@
 MONFLAG=$HOME/vault/$ESCAPED_NAME/monthly
 if [ -f $MONFLAG ]
 then
+
+# If indexing by month, we care about the date
+# If importing, see if x-archive-with-date is set first.
+# Use "Date:" as last resort because of mis-set clocks.
+   XDATE=`echo "$T" | grab "x-archive-with-date"`
+   [ ! "$XDATE" ] && XDATE=`echo "$T" | receiveme`
+   [ ! "$XDATE" ] && XDATE=`echo "$T" | grab "date"`
+   if [ ! "$XDATE" ]
+   then
+   emergency_divert NODATE "Unable to find any date field."
+   exit -1
+   fi
+   JUSTDATE=`echo "$XDATE" | sed -e 's/^date: //i' \
+  -e 's/^x-archive-with-date: //i' \
+  -e 's/^received:.*; //i'`
+   date -d "$JUSTDATE" > /dev/null 2>&1
+   ex=$?
+   if [ "$ex" != "0" ]
+   then
+   emergency_divert NODATE "Unable to find valid date field."
+   exit -1
+   fi
+
 echok info "Now switching to monthly indexing"
 MM=`date -d "$JUSTDATE" +"%Y-month-%m"`
 MONBEG=`date -d "$JUSTDATE" +"%m/01/%Y:00:00:00.00"`



Re: Archiving by month

1999-09-05 Thread Jeff Breidenbach


First, sorry for all the folks who are planning to unsubscribe to
gossip due to the increase of traffic. My guess is things will die
down again within a week or three.



Paul, see how attachments end up in subdirectories, for example
http:[EMAIL PROTECTED]/msg00459.html
The default rcfile puts attachments in a subdirectory, with a .dir
extension. You are probably overriding the MIMEArgs directive, or
perhaps .html attachments are treated differently.


text/plain; maxwidth=87
m2h_external::filter; usename useicon subdir
iconurl="../attachment.gif"


So -- sorry for the earlier email typo regarding digger. The solution
is going to be the following, and we can patch up your rcfile
to use subdirectories (if needed) a bit later.

   echo "limit_urls_to:$TARGET/$MAILLIST/"  >> $CFG
   echo "exclude_urls: .mhonarc.db .htaccess .dir"  >> $CFG

>Attached is b1), then.  I've added some sanity checking for the date
>field so it should be pretty robust.

Good, because utterly corrupt date fields are starting to litter
my error logs with complaints from 'date'. And it's not obvious that
mailme will do the right thing when 'date' fails. (Guess that's
why it's considered an experimental feature!)

>I won't hack bounce.pl for fear of doing wrong, but yes I'm pretty
>sure I agree now that it should set x-archive-with-date from the
>original Recieved: fields if possible.

It would also have to be intelligent enough to make sure its output
is reasonable and machine readable, even if the input is less than
clear. Additionally, it would need to work in such a way that if it
was run twice over the same message, it would not mess up. (i.e.
just preserve the x-arcive-with-date if it is already there)

Jeff



Re: Archiving by month

1999-09-05 Thread PS Mitchell

Jeff Breidenbach said:

> Solution sets that I see are:
> 
>   b1) ask mailme to do monthly sorts off of x-archive-with-date
>   headers primarily, received headers secondarily
>   b2) modify bounce.pl to generate x-archive-with-date based on
>   received headers.

Attached is b1), then.  I've added some sanity checking for the date
field so it should be pretty robust.  With this implemented, it means
that mailme expects bounce.pl (or equivalent) to set the
x-archive-with-date field intelligently, and doesn't do any second
guessing, which is how I think it should be.  If it doesn't find
x-archive-with-date, it looks for Received:, and if that's not set it
falls over to Date:, which is the way Jeff's rcfile deals with dates
too, so is consistent.  So now mailme and rcfile agree that, if
you're importing, you HAVE to set an intelligent x-archive-with-date
field; else, they both use Received:, which goes by the clock on
Jeff's server.

I won't hack bounce.pl for fear of doing wrong, but yes I'm pretty
sure I agree now that it should set x-archive-with-date from the
original Recieved: fields if possible.

Paul


--- mailme.jeff Sun Sep  5 14:08:19 1999
+++ mailme  Sun Sep  5 15:25:07 1999
@@ -319,8 +319,26 @@
 X14=`echo "$T" | grab "list-post"`
 
 # If indexing by month, we care about the date
-DATE=`echo "$T" | grab "date"`
-JUSTDATE=`echo $DATE | sed 's/^date: //i'`
+# If importing, see if x-archive-with-date is set first.
+# Use "Date:" as last resort because of mis-set clocks.
+XDATE=`echo "$T" | grab "x-archive-with-date"`
+[ ! "$XDATE" ] && XDATE=`echo "$T" | grab "received"`
+[ ! "$XDATE" ] && XDATE=`echo "$T" | grab "date"`
+if [ ! "$XDATE" ]
+then
+   emergency_divert NODATE "Unable to find any date field."
+   exit -1
+fi
+JUSTDATE=`echo "$XDATE" | sed -e 's/^date: //i' \
+  -e 's/^x-archive-with-date: //i' \
+  -e 's/^received: //i'`
+date -d "$JUSTDATE" > /dev/null 2>&1
+ex=$?
+if [ "$ex" != "0" ]
+then
+   emergency_divert NODATE "Unable to find valid date field."
+   exit -1
+fi
 
 # Extract email addresses
 CHANCE=$(echo $TO $CC $X1 $X2 $X3 $X4 $X5 $X6 $X7 $X8 $X9 $X10 $X11 $X14 |\



Re: Archiving by month

1999-09-05 Thread PS Mitchell

Jeff said:

> I took a closer look. Attachments are never indexed because of
> 
> echo "exclude_urls: .dir">> $CFG

I'm confused here: attachments don't appear in anything called *.dir,
but in a subdirectory called msg? for the relevant message - see

http://www.mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174.html

Certainly on my home setup they get indexed.  If I run:

htdig -a -c /etc/htdig/sinister_majordomo_net.conf -vvv

it shows them being indexed here.  And yes, we could restrict it to just
html files, but again, because limit_urls_to is so hopeless, it would
mean throwing away the directory check I think, because all you could
put was ".html".

> So I now feel safe doing your original
> 
> -echo "limit_urls_to:$TARGET/$MAILLIST/"  >> $CFG
> +echo "limit_urls_to:$TARGET/$MAILLIST/msg"   >> $CFG

- but the other way round, surely?  My change no. 1 yesterday was just to
take the "msg" away - a bit of a dirty kludge, but it preserved path info,
thus my patch no. 2 which just completely reproduced your limits (including
"msg") but for all named subdirectories too.  So patch #2 I sent more
exactly traces what you did for non-monthlies, with the cost of a little
more complexity.  I'm sure either would do, as long at
"$TARGET/$MAILLIST/msg" itself isn't hardcoded in, as that will break
monthlies because it misses all the monthly subdirectories.

> By the way, index pages are already protected
> from htdig in exactly the correct fashion (don't index the page,
> but do follow the links) by the META tag at the top of the page.

Ah yes, I missed that - you'll maybe want to add that to the
monthme-generated page too then?

Re. dates:

> Solution sets that I see are:
> 
>   a1) ask mailme to do monthly sorts off of received headers
>   a2) find another way to do imports
> 
>   b1) ask mailme to do monthly sorts off of x-archive-with-date
>   headers primarily, received headers secondarily
>   b2) modify bounce.pl to generate x-archive-with-date based on
>   received headers.
> 
> What I don't like is the extra complexity that all this will entail,
> and also the fact that it would require duplicating some of the
> functionality already found in MHonArc. Again, I'm going to ignore
> this all for a while and worry about installing bigger disks.

All I'd say is I don't think it's mailme's job to sort this out: the
problem lies in the data source, and mailme can't really and shouldn't
second-guess what the data *really* meant.  Previous discussions on gossip
have tended strongly to sorting order as:

x-archive-with-date:received:date

which archives new data "correctly" (relies on your local server
clock) but allows imported data to be dealt with specially if the
list manager's prepared to make an effort.  So I'd strongly favour
b), and yes, why not make bounce.pl use the original Received:
headers.  Good plan, I wish I had :)  If anyone wants a lesson in why
relying on Date: isn't a good idea on a reasonably sized (1000
members) list, visit 

http://www.mail-archive.com/sinister%40majordomo.net/

I'll sort it out when my list's up to date Jeff, I promise.

Paul



Re: Archiving by month

1999-09-04 Thread Jeff Breidenbach


I took a closer look. Attachments are never indexed because of

echo "exclude_urls: .dir">> $CFG

so I now feel safe doing your original

-echo "limit_urls_to:$TARGET/$MAILLIST/"  >> $CFG
+echo "limit_urls_to:$TARGET/$MAILLIST/msg"   >> $CFG

and I've now done so. By the way, index pages are already protected
from htdig in exactly the correct fashion (don't index the page,
but do follow the links) by the META tag at the top of the page.

Jeff








Re: Archiving by month

1999-09-04 Thread Jeff Breidenbach


The problem has the following facets:

 Computers with misset clocks screw up 'Date:'
 MH's 'pick -before/-after' command probably uses 'Date:'
 Imports emphasize the 'Date:' header
 Importing by bounce adds an unhelpful 'Received:' header.

Solution sets that I see are:

  a1) ask mailme to do monthly sorts off of received headers
  a2) find another way to do imports

  b1) ask mailme to do monthly sorts off of x-archive-with-date
  headers primarily, received headers secondarily
  b2) modify bounce.pl to generate x-archive-with-date based on
  received headers.

What I don't like is the extra complexity that all this will entail,
and also the fact that it would require duplicating some of the
functionality already found in MHonArc. Again, I'm going to ignore
this all for a while and worry about installing bigger disks.

>Yes, htDig going somewhere it shouldn't might be a problem, I can see
>that.  But Jeff, hasn't it been indexing attachments anyway already?
>limit_urls_to was set to foo/msg which is where the attachments are stored
>too, isn't it?

Nope. Attachments are placed in subdirectories; this is specified in
the default rcfile. As far as fancy htdig configuration tricks, I've
found the htdig users mailing list extremely hlful ([EMAIL PROTECTED])
so they may be able to help. Maybe we can restrict htdig to messages
only by telling htdig to only read .html files?

>That's ok as long as you don't mind me throwing these patches at you
>from time to time.  I think we're nearly there, honestly.

Patches are always appreciated.

Cheers,
Jeff



Re: Archiving by month

1999-09-04 Thread PS Mitchell

Hi Jeff,

First off, I made a silly mistake in monthme - producing links to
html in subdirectories of course doesn't work, as none of the
relative links work.  This is quickly solved by linking to the
directory instead which makes a lot more sense anyhow.  The patch
I've attached (see below) fixes this.

> I applied your patch to the monthly index page generator -- and it
> looks like quite an improvement. However, relying on the 'Date:'
> header is definitely hurting, and something is inconsistant.  This may
> take some work, possibly including perhaps bounce.pl modifications.
> See:
> 
> http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/maillist.html
> http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/msg0.html

I just saw these and nearly died - actually the code's doing what it
should: these dates were actually *in* the mails - misconfigured clients.
I'd skimmed through the discussion in gossip on "received date" and
stupidly assumed I hadn't a problem with my list.  Those who don't learn
from history are doomed to repeat it.  So I don't think it's a code
problem, and goes back instead to the question of how to manage imports.
Maybe bounce.pl could do a sanity check on on "Date:" vs. "Received:"
within some bounds (Michael?) but that's all I can suggest.  Right now I
can't see anywhere the mailme/monthme code is flakey in it's date handling,
although I could be missing something.

Jeff - sorry for these and I think it's best to let my list accumulate now
- I'll try and deal with it later and maybe ask you just to zap those, if I
can make it easy for you to do: I'll send a simple script or something if
that's ok.

> The patches to digger/rcfile are good, but I am concerned about htdig
> trying to index attachments, which I don't want it to do.  I'm afraid
> of htdig getting bogged down with some weird mime type.  Thus, I
> haven't applied the patches until this issue gets
> resolved. Suggestions?

Yes, htDig going somewhere it shouldn't might be a problem, I can see
that.  But Jeff, hasn't it been indexing attachments anyway already?
limit_urls_to was set to foo/msg which is where the attachments are stored
too, isn't it?

But looking back I think you're right that my solution (removing "msg"!)
isn't very robust.  In case anyone's following this, the problem is how to
ask htdig to index the following directories under a root starting point of
maillist.html:

msg*.html
-month-??/msg*.html

You can add multiple "patterns" to limit_urls_to, but you can't use
wildcards ("patterns"??) and multiple strings are or'd, not and'd.  And you
don't know how many -month-??'s there are.  And if you just set it to
"msg", you run the risk of a list with "msg" in its name.  You also can't
do anything creative with a combination of "limit_urls_to" and
"limit_normalized" (I tried).  Oh and you can't use "exclude_urls" because
you don't know index file names for customised lists and whatever else
might end up in there.

I'm sending you two more diff's with my proposed solution - again diff's
from your original source.  I propose monthme writes out an list of files
for htdig to index, and htdig in digger uses its `...` option to include
them, along with your original pattern so monthlies don't break.  My
version of htdig doesn't seem to mind that the file doesn't exist for
non-monthly lists.

> Finally, I won't be able to work on these things myself as my current
> priority is installing additional disk space.

That's ok as long as you don't mind me throwing these patches at you
from time to time.  I think we're nearly there, honestly.

Paul


--- ../bin/monthme.jeff Sat Sep  4 17:44:09 1999
+++ monthme Sun Sep  5 00:30:03 1999
@@ -29,6 +29,8 @@
 MAILLIST=$1
 NICKNAME=$2
 
+TARGET=http://localhost
+
 #
 ###   Action  ###
 #
@@ -41,9 +43,18 @@
 CTRAIL=$CONFDIR/trailer-monthly.html
 [ -f $CHEAD ] && cat $CHEAD >> $MONTHINDEX
 
+MAILLIST=$(echo $MAILLIST | awk '{ print tolower($1) }')
+ESCAPED_NAME=$(echo $MAILLIST | tr '@.' '__')
+
 # Start off the page
 cat >> $MONTHINDEX <$NICKNAME mailing list:
+Monthly index for the $NICKNAME mailing list
+
+Latest Messages by Date
+
+Latest Messages by Thread
+
+By month:
 
 EOF
 
@@ -59,13 +70,43 @@
 # End month list and start search section
 cat >> $MONTHINDEX <
+Search $NICKNAME
+
 
 
 
-
-
-Restrict matched files
-
+
+Search options:
+
+
+Match: 
+All
+Any
+Boolean
+
+
+Format: 
+Long
+Short
+
+
+Sort by: 
+Score
+Date
+Name
+
+
+
+Results per page: 
+10
+20
+50
+100
+
+
+Restrict search to months:
+
+
 EOF
 
 # Optional button for searching within each month
@@ -75,11 +116,9 @@
 
 # Finish off search section
 cat >> $MONTHINDEX <
 
-
-
-
+
+
 
 
 
@@ -90,4 +129,16 @@
 ln -s $HOME/archive/$MAILLIST/maillist.html \
 $HOME/archive/$MAILLIST/index.html
 
-
+# Compile list of subdirs for htdig 

Re: Archiving by month

1999-09-04 Thread Jeff Breidenbach


Paul,

I applied your patch to the monthly index page generator -- and it
looks like quite an improvement. However, relying on the 'Date:'
header is definitely hurting, and something is inconsistant.  This may
take some work, possibly including perhaps bounce.pl modifications.
See:

http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/maillist.html
http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/msg0.html

The patches to digger/rcfile are good, but I am concerned about htdig
trying to index attachments, which I don't want it to do.  I'm afraid
of htdig getting bogged down with some weird mime type.  Thus, I
haven't applied the patches until this issue gets
resolved. Suggestions?

Finally, I won't be able to work on these things myself as my current
priority is installing additional disk space.

Jeff























Re: Archiving by month

1999-09-04 Thread PS Mitchell

Hi Jeff,

Some more mods to take care of monthly searching (which is currently
broken), and a couple of other suggested fixes.  I've attached patches for
the following (I've diff'd these with the latest source you sent me - let
me know if you want the full versions):

monthme:

I think it lost the ESCAPED_NAME variable somewhere when it was
externalised, which is why the searching wasn't working and it wasn't
incorporating your html for the results page.  I've also added all the
search options I can that htDig allows, so that complex searching can be
done.  Oh, and my "restrict" HTML code was broken, now fixed, so searching
by month should work.  It's still not very pretty, but it works, feel free
to make it look nicer.  Finally, I've added some rather inelegant code to
create links in the master directory to the "latest" indexes - I think
these will be necessary to give list managers a canonical URL which they
can reference which will go to the latest messages at any time - many users
will want to bookmark this I've found in the past.  I'm sure it can be done
in a neater way but I can't think how just now.

digger:

Very minor modification to cope with htDig indexing monthly lists.  I
didn't know htDig before I started this and I have to say I'm not impressed
with the configuration options: you can't include wildcards or "and"
clauses where you might want to ("limit_urls_to") as far as I can see, and
can't include string lists (only strings) where they would be really useful
("noindex_start") - the latter would mean you could use
automatically-generated MHonarc HTML tags to exclude parts of the generated
pages, but it won't let you.  To make sure htDig doesn't index irrelevant
information from index pages, I've also included a modified:

rcfile:

which extends your own use of the  tag to the index
pages too - otherwise no other changes in there.

I've had to add the above to my own rcfiles, so again I'll send you them in
a separate mail.  Sorry about the changes, these are to reflect the above
in my own files, and also some other fixes.

Finally Michael said:

> Have you considered also allowing a way to change archives
> when an archive accumulates a certain number of messages?  

I'm sure Jeff's thought about this and I did too a little: the main problem
seemed to me that a simple decision based on something like this might not
actually be appropriate for lots of lists, e.g. imagine a list that's been
going for 5 years and crosses the threshold but actually only has a handful
of messages per month -> loads of nearly empty indexes and a list manager
asking "why??".  I suspect a human would choose to go to monthly indexes
when (say) more than half the months had more than 20 (or 30 or 50)
messages, which does increase the complexity (and therefore breakability)
of Jeff's rather neat and simple code.  So I left anything like that out
for now.

Paul


--- ../bin/monthme.jeff Sat Sep  4 17:44:09 1999
+++ monthme Sat Sep  4 20:01:51 1999
@@ -41,9 +41,18 @@
 CTRAIL=$CONFDIR/trailer-monthly.html
 [ -f $CHEAD ] && cat $CHEAD >> $MONTHINDEX
 
+MAILLIST=$(echo $MAILLIST | awk '{ print tolower($1) }')
+ESCAPED_NAME=$(echo $MAILLIST | tr '@.' '__')
+
 # Start off the page
 cat >> $MONTHINDEX <$NICKNAME mailing list:
+Monthly index for the $NICKNAME mailing list
+
+Latest Messages by Date
+
+Latest Messages by Thread
+
+By month:
 
 EOF
 
@@ -59,13 +68,43 @@
 # End month list and start search section
 cat >> $MONTHINDEX <
+Search $NICKNAME
+
 
 
 
-
-
-Restrict matched files
-
+
+Search options:
+
+
+Match: 
+All
+Any
+Boolean
+
+
+Format: 
+Long
+Short
+
+
+Sort by: 
+Score
+Date
+Name
+
+
+
+Results per page: 
+10
+20
+50
+100
+
+
+Restrict search to months:
+
+
 EOF
 
 # Optional button for searching within each month
@@ -75,11 +114,9 @@
 
 # Finish off search section
 cat >> $MONTHINDEX <
 
-
-
-
+
+
 
 
 
@@ -90,4 +127,11 @@
 ln -s $HOME/archive/$MAILLIST/maillist.html \
 $HOME/archive/$MAILLIST/index.html
 
-
+# Create link to latest indexes
+# Note: not "this month" - might be no messages yet.
+LATESTM=`/bin/ls -d1p $HOME/archive/$MAILLIST/* | grep '\/$' \
+ | sort -nr | head -1`
+rm -f $HOME/archive/$MAILLIST/latest-maillist.html
+rm -f $HOME/archive/$MAILLIST/latest-index.html
+ln -s $LATESTM/maillist.html $HOME/archive/$MAILLIST/latest-maillist.html
+ln -s $LATESTM/index.html $HOME/archive/$MAILLIST/latest-index.html


--- ../bin/digger.jeff  Sun Aug 29 19:26:14 1999
+++ digger  Sat Sep  4 19:05:45 1999
@@ -87,7 +87,10 @@
 echo "nothing_found_file: $CONF/nomatch.html">> $CFG
 echo "search_results_wrapper: $CONF/wrapper.html">> $CFG
 
-echo "limit_urls_to:$TARGET/$MAILLIST/msg"   >> $CFG
+#PSM start
+#echo "limit_urls_to:$TARGET/$MAILLIST/msg"   >> $CFG
+echo "limit_urls_to:$TARGET/$MAILLIST"   >> $CFG
+#PSM en

Re: Archiving by month

1999-09-03 Thread Jeff Breidenbach


>Your plan to allow monthly archives on mail-archive sounds really
>good.

The sorting engine already knows how to separate different lists from
each other; Paul's refinement was to narrow the definition of a list to
limit it to a one month timespan -- so to the sorting engine each
month is handled independently. The code changes were small, clever,
and integrate quite smoothly with the existing code. Best of all, the
sort engine doesn't have to save any extra state information between
runs.

>Have you considered also allowing a way to change archives
>when an archive accumulates a certain number of messages?  

It does not appear to be an obvious extension (to me).

Jeff



Re: Archiving by month

1999-09-01 Thread PS Mitchell

Jeff,

A couple of belated comments back - you'll probably know the answers
to these already by now.

> I don't quite follow. Let's say there are a bunch of messages for
> 'sinister' in the inbox. So if it's current August 1999, we use pick
> to select just those messages from August 1999.  But what happens if there
> are messages from July 1999 in the inbox? Do they stay there forever?
> Or do we have to loop this process over each month from the dawn of
> time? What am I missing here?

I had the impression that's how your own code worked, by just picking
out "similar" mails from the mailbox, based on destination address -
seemed sensible to me and just extended this to only picking out
mails for the same month.  Unless I'm missing something :)

> maillist.html --> most recent date index
> index.html--> most recent thread index
> ???.html  --> monthly index (either 'monthly.html' 'meta.html' or
>   some other new name.)
> 
> This allows consistancy for folks from the outside who link to lists
> at mail-archive.  Even if htdig starts at maillist.html, it should
> have no problem finding the monthly index, as long as there are links
> to follow. Symlinks would probably be useful here.

My thinking was that no files would reside at the default level where they
normally do for non-monthly lists, excepting and index.html and
maillist.html - so these fiels wouldn't be replacing anything, as all the
MHonarc-generated .html's are shuffled into month-specific sub-directories,
and where a no-monthly list would find in $list/maillist.html the latest
messages index, a monthly list would just use this as a pointer to the
directories below.

You probably know this by now anyway: I think you're right, actually,
that symlinks should be used: there should be a canonical URL for
each list which points to the latest indexes: 

maillist.html -> -month-MM/mailist.html

This should be easy to do.  As you say, a cascaded monthly-specific
rc file for MHonarc with links to a separate index page (my
maillist.html, but named something else) would mean htdig indexes it
all just fine.  I forgot to think about this, as I'd done so much
work on my own rc's. :)

Oh and of course, your comments about ownership of code are fine:
you're welcome to do as you wish with my code, such as it is.  Look
forward to seeing your comments, and I'll volunteer my list to act as
guinea-pig if you like for monthly indexes: I'm very keen to get it
working as soon as I can.

Paul



Re: Archiving by month

1999-08-29 Thread Jeff Breidenbach


>1. Firstly, I've dodged your question "Do we need to automatically detect
>very low traffic lists? How?" !  The generation of monthly indexes is
>simply triggered in my scheme by the presence of a file "monthly" in vault
>directory for a specific list.

That sounds extremely reasonable. Good idea.

>2. If this file is found, pick is called in mailme with -after and -before
>to only run as far as mail messages in the same month, and they're all
>dumped in a subdirectory of archive/$MAILLIST with the format
>"-month-MM" instead of at the top level.  MHonarc then works in that
>directory for each mailme run.  Note: this relies in "pick" not knowing
>some months have less than 31 days!  Thank goodness - just saves some
>code.

I don't quite follow. Let's say there are a bunch of messages for
'sinister' in the inbox. So if it's current August 1999, we use pick
to select just those messages from August 1999.  But what happens if there
are messages from July 1999 in the inbox? Do they stay there forever?
Or do we have to loop this process over each month from the dawn of
time? What am I missing here?

>3. mailme also creates a page at the original level, archive/$MAILLIST
>called "maillist.html", duplicating the MHonarc-generated file previously.
>This serves two purposes: (a) acts as the index page to each month for
>every list that's chosen to be monthly; and (b) acts as the starting point
>for htdig, because your setup expects a file "maillist.html" to be there
>for indexing.  This is the neatest way I could think to do it, and I'm
>hoping everything else falls in neatly with a minor change or two to the
>htdig conf file - I haven't looked at it properly yet.

I'd prefer something that keeps the following properties:

maillist.html --> most recent date index
index.html--> most recent thread index
???.html  --> monthly index (either 'monthly.html' 'meta.html' or
  some other new name.)

This allows consistancy for folks from the outside who link to lists
at mail-archive.  Even if htdig starts at maillist.html, it should
have no problem finding the monthly index, as long as there are links
to follow. Symlinks would probably be useful here.

>Also note the maillist.html must only reference
>-month-MM/maillist.html's and not index.html's or the htdig
>indexing will duplicate (I think).

>Htdig should just index all "msg" files as before, but include those in
>subdirectories, obviously.  I can look at this further if you'd rather I
>did, I'll just need to do a little htdig reading.

htdig is reasonably smart. Even if you have 10 links to a given URL
for htdig to follow, htdig will only index that URL once. It will
happily follow links. I don't anticipate any problems, or even
configuration changes.

>mailme does a "ln -s maillist.html index.html" for monthly lists at
>this level in mailme

Maybe not? Currently, 'index.html' is the name of the threaded index,
and 'maillist.html' is the name of the date index. The command above
will probably attempt (and fail because of the lack of the -f) to
clobber the thread index. This is probably not what you are aiming
for.

>You'll want to create templates, conf/heading-monthly.html and
>trailer-monthly.html which will get added to this monthly index file: I
>just haven't bothered as you'll want to create your own house style - it'll
>run as is without and show you a very basic file anyhow.

Sounds good...

>Finally mailme adds a search box to this maillist.html with the ability to
>search each month via htsearch's option "restrict" - I've sketch it out
>with some poor html for now (which I haven't checked) but you get the
>idea.

Ok, assuming you're talking about the monthly index. I also have not
looked into the htdig 'restrict' option but it sounds quite reasonable.

>1. Minor (hopefully) changes to digger so that the conf file it produces
>causes htdig to follow all links from archive/$MAILLIST/maillist.html and
>indexes msg* files in subdirectories too.

As stated previously, I think we get this for free.

>2. (Optional) decide on a trigger for "touch monthly".

Manual will be fine for time being.

>3. (Optional) write a conf/header-monthly.html and conf/trailer-monthly.html
>and include in mailme.

This is cosmetic - we can have some very simple placeholders until the
underlying machinery is proven.

Overall, these suggestions look great, and I look forward to reading
your code. By the way, in the future you may consider using 'diff
-uNr' between original files and modified files; it makes it very easy
to read the changes. It's also ok submitting changes the way you have,
because the source is so small. You (and others) are welcome to send diffs
over the list or through personal email.

Also, just in case you are interested in software license issues,
patches have to be sumbitted under the BSD license or equivalent,
which basically means you get a big thank you in the FAQ, and I can
use the patch without restriction. As an aside, I may 

Archiving by month

1999-08-29 Thread PS Mitchell

Hi Jeff,

Re. mods to allow mail-archive deal with some lists on a per-month basis.
I've had a good look and will send in a separate mail to you direct a
patched mailme.model for monthly indexes.  Take a look and see what you
think and I'll have a think meantime about the rest of your system,
although I do think other changes are trivial (hopefully!).  I'll also send
on an rcfile for the list "sinister", which I'd be obliged if you'd nuke so
I can try out.

All my changes are surrounded by "#PSM start" and "#PSM end" in the source
- sorry it's a little messy just now.  I hope you'll find the changes very
simple, I've tried to keep them so.  Please note also I haven't ever used
MH or htdig before, but I'm pretty sure what I've done is portable.

Here's the philosophy:

1. Firstly, I've dodged your question "Do we need to automatically detect
very low traffic lists? How?" !  The generation of monthly indexes is
simply triggered in my scheme by the presence of a file "monthly" in vault
directory for a specific list.  It seems to me that any old trigger could
be built into mailme to touch this file, and the implementation is dead
easy: just touch the file, nuke and rebuild.  So I haven't decided what the
trigger should be for now.  You may just want to touch monthly for lists
who ask for it for the moment (mine!), nuke and rebuild?

2. If this file is found, pick is called in mailme with -after and -before
to only run as far as mail messages in the same month, and they're all
dumped in a subdirectory of archive/$MAILLIST with the format
"-month-MM" instead of at the top level.  MHonarc then works in that
directory for each mailme run.  Note: this relies in "pick" not knowing
some months have less than 31 days!  Thank goodness - just saves some
code.

3. mailme also creates a page at the original level, archive/$MAILLIST
called "maillist.html", duplicating the MHonarc-generated file previously.
This serves two purposes: (a) acts as the index page to each month for
every list that's chosen to be monthly; and (b) acts as the starting point
for htdig, because your setup expects a file "maillist.html" to be there
for indexing.  This is the neatest way I could think to do it, and I'm
hoping everything else falls in neatly with a minor change or two to the
htdig conf file - I haven't looked at it properly yet.

You'll want to create templates, conf/heading-monthly.html and
trailer-monthly.html which will get added to this monthly index file: I
just haven't bothered as you'll want to create your own house style - it'll
run as is without and show you a very basic file anyhow.  Also note the
maillist.html must only reference -month-MM/maillist.html's and not
index.html's or the htdig indexing will duplicate (I think).  mailme does a
"ln -s maillist.html index.html" for monthly lists at this level in mailme,
but I suppose index.html could be created with links to all indexes if
someone wanted it, because it won't get indexed by htdig.

Htdig should just index all "msg" files as before, but include those in
subdirectories, obviously.  I can look at this further if you'd rather I
did, I'll just need to do a little htdig reading.

Finally mailme adds a search box to this maillist.html with the ability to
search each month via htsearch's option "restrict" - I've sketch it out
with some poor html for now (which I haven't checked) but you get the
idea.


- To Do -

1. Minor (hopefully) changes to digger so that the conf file it produces
causes htdig to follow all links from archive/$MAILLIST/maillist.html and
indexes msg* files in subdirectories too.

2. (Optional) decide on a trigger for "touch monthly".

3. (Optional) write a conf/header-monthly.html and conf/trailer-monthly.html
and include in mailme.

I can do any or all of 1-3 if you like - let me know when you've had a look at
the work so far.


Finally, someone previously asked on the gossip list why bounce.pl suffered
from a broken pipe - I think that's when it's not run as root.

Let me know what you think,
Paul