Re: Archiving by month
Hi Earl, I think best solution is for HtDig configuration to become more flexible, and that is part of their development plan. If there are problems in the meantime, I'll probably go ahead and patch MHonArc. That's probably better asthetically than cluttering MHonArc's feature space with the workaround. Jeff
Re: Archiving by month
On September 8, 1999 at 05:06, Jeff Breidenbach wrote: > 4) Htdig will now index the attachments. I didn't really want this, but >I also don't want the administrative headache of running a patched >MHonArc or a patched htdig, which is required for this sort of >functionality. I'll revisit this either when a future cersion of >htdig becomes available, or if it turns out to be a problem. I can add back the .dir extension to the mhexternal.pl filter. I can add the option "nosubdirext" for it so users that do not want it have a way to remove the extension w/o patching the code. I know the original reason, as mentioned by a user, for removing the .dir extension was due to the stupidity of other software, but I figured there would be no harm to remove the .dir. I can create a minor update release of MHonArc, which will include some other modifications beside the .dir thing, in the next couple of days if you desire. --ewh
Re: Archiving by month
So what's going on? 1) Paul's latest date patches have been applied. (Thanks, Paul!) 2) A new rcfile for [EMAIL PROTECTED] has been added, and the misdated raw email has been erased. The list will be rebuilt, but not until I take care of a disk which is now at 99% capacity. 3) A bunch of names were added to the bottom of the FAQ for acknowledgements, but I'm sure I forgot several. Who did I forget to include? 4) Htdig will now index the attachments. I didn't really want this, but I also don't want the administrative headache of running a patched MHonArc or a patched htdig, which is required for this sort of functionality. I'll revisit this either when a future cersion of htdig becomes available, or if it turns out to be a problem. 5) For those interested, the service is now getting around 30,000 page views today. While much of that was altavista and friends doing indexing, a reasonable chunck is real people finding information that they needed. It is a very good feeling knowing that your work is being used.
Re: [htdig] Re: Archiving by month
At 10:18 PM -0500 9/6/99, Jeff Breidenbach wrote: >If I could say "Ignore everything that does not end in .html" or >"only index URLs with a certain regexp" that would do the trick. >But with the current configuration options, I just don't see how to do >this. There's a patch in the patch archive for restricting based on *including* only certain extensions. I can't remember the URL offhand... The 3.2 codebase has both this, as well as full rexep for restricting indexing and searching. However, I would prefer not to backport anything since we're nearing a 3.1.3 release, as well as a 3.2.0b1 release (the former in the next few days, the latter probably by the end of the month). -Geoff Hutchison Williams Students Online http://wso.williams.edu/
Re: Archiving by month
Hi htdig folks, I'm having a bit of a problem getting what I want from the htdig configuration options. Lots of people, myself included, use htdig in conjunction with MHonArc. In the current release version of MHonArc (2.4.3, which I recently upgraded to) attachments may be stored in subdirectories as following: The first URL is the message, while the second is the attachment. No need to follow the links, just look at their structure. http://mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174.html http://mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174/The_state_i_am_in.txt My question is, using the current stable version of htdig, how can I configure it to ONLY index messages, and not index attachments? If I could say "Ignore everything that does not end in .html" or "only index URLs with a certain regexp" that would do the trick. But with the current configuration options, I just don't see how to do this. Thanks in advance for enlightenment. Jeff
Re: Archiving by month
On September 5, 1999 at 13:49, Jeff Breidenbach wrote: > Paul, see how attachments end up in subdirectories, for example > http:[EMAIL PROTECTED]/msg00459.html > The default rcfile puts attachments in a subdirectory, with a .dir > extension. You are probably overriding the MIMEArgs directive, or > perhaps .html attachments are treated differently. I think the problem is a change in mhexternal.pl of MHonArc in the naming of attachment subdirectories. It appears I did not mention it in CHANGES, but here is the SCCS delta comment on it: D 2.7 99/06/25 13:59:18+05:00 [EMAIL PROTECTED] 22 21 3/3/228 P /home/ehood/work/perl/MHonArc/lib/mhexternal.pl C Removed addition of ".dir" to subdir. According to the date, it was applicable for v2.4.0 or v2.4.1. I cannot remember the exact reason for the change, but some user had problems with the ".dir" so I figured no harm (ha ha) would occur if I removed the ".dir". I do not know how htdig works, but can it index specified list of file types (eg: .html, .txt), or can you specify a regex/glob mask (or match) to control indexing? --ewh
Re: Archiving by month
Hi Jeff, I'm sorry you had problems with my previous date code - hope it hasn't caused you much grief. I just didn't realise there were mail clients out there which produced malformed date strings (well, ones that date(1) can't read at least). The last patch to mailme I sent was (again) incorrect as it didn't take care of multiple Received: lines, and the junk on a Received line before the semicolon. I've attached a patch (again from your original source) which deals with this (it uses the top Received: line which I think is safest), and also skips the date code completely unless it's a monthly list, just so that any problems don't affect non-monthly lists at all. But providing x-archive-with-list is formatted ok, or Received: lines are sensible (which come from your server so they are), malformed dates hopefully should be a thing of the past. With regard to attachments: I copied over your own MIMEargs (just added a "target="), and I've gone back and regenerated my archives here with your rcfile alone: and yes, I think it's with text/plain attachments that the msg* subdirectory gets created. I don't know why they don't tack .dir on the end for these attachments, at least here, but I'm sure it's not a worry. Paul --- mailme.jeff Mon Sep 6 21:17:09 1999 +++ mailme Mon Sep 6 21:01:19 1999 @@ -103,6 +103,25 @@ } ' } +# Special case of grab() for Received: lines +# This is because we want to take first occurrence +# Example usage: cat messageheaders | receiveme +receiveme() { +$NAWK ' +BEGIN { +m = "^Received:" +} +{ +if (match($0, m)) { +print $0 +getline +while ( $0 ~ /^[ \t]+/) { +print $0; getline +} +} +} ' +} + # Get all email addresses, and precede them with a carat. # Example usage: cat RMAIL | waterfall waterfall () { @@ -318,10 +337,6 @@ X13=`echo "$T" | grab "mailing-list"` #use later X14=`echo "$T" | grab "list-post"` -# If indexing by month, we care about the date -DATE=`echo "$T" | grab "date"` -JUSTDATE=`echo $DATE | sed 's/^date: //i'` - # Extract email addresses CHANCE=$(echo $TO $CC $X1 $X2 $X3 $X4 $X5 $X6 $X7 $X8 $X9 $X10 $X11 $X14 |\ waterfall) @@ -453,6 +468,29 @@ MONFLAG=$HOME/vault/$ESCAPED_NAME/monthly if [ -f $MONFLAG ] then + +# If indexing by month, we care about the date +# If importing, see if x-archive-with-date is set first. +# Use "Date:" as last resort because of mis-set clocks. + XDATE=`echo "$T" | grab "x-archive-with-date"` + [ ! "$XDATE" ] && XDATE=`echo "$T" | receiveme` + [ ! "$XDATE" ] && XDATE=`echo "$T" | grab "date"` + if [ ! "$XDATE" ] + then + emergency_divert NODATE "Unable to find any date field." + exit -1 + fi + JUSTDATE=`echo "$XDATE" | sed -e 's/^date: //i' \ + -e 's/^x-archive-with-date: //i' \ + -e 's/^received:.*; //i'` + date -d "$JUSTDATE" > /dev/null 2>&1 + ex=$? + if [ "$ex" != "0" ] + then + emergency_divert NODATE "Unable to find valid date field." + exit -1 + fi + echok info "Now switching to monthly indexing" MM=`date -d "$JUSTDATE" +"%Y-month-%m"` MONBEG=`date -d "$JUSTDATE" +"%m/01/%Y:00:00:00.00"`
Re: Archiving by month
First, sorry for all the folks who are planning to unsubscribe to gossip due to the increase of traffic. My guess is things will die down again within a week or three. Paul, see how attachments end up in subdirectories, for example http:[EMAIL PROTECTED]/msg00459.html The default rcfile puts attachments in a subdirectory, with a .dir extension. You are probably overriding the MIMEArgs directive, or perhaps .html attachments are treated differently. text/plain; maxwidth=87 m2h_external::filter; usename useicon subdir iconurl="../attachment.gif" So -- sorry for the earlier email typo regarding digger. The solution is going to be the following, and we can patch up your rcfile to use subdirectories (if needed) a bit later. echo "limit_urls_to:$TARGET/$MAILLIST/" >> $CFG echo "exclude_urls: .mhonarc.db .htaccess .dir" >> $CFG >Attached is b1), then. I've added some sanity checking for the date >field so it should be pretty robust. Good, because utterly corrupt date fields are starting to litter my error logs with complaints from 'date'. And it's not obvious that mailme will do the right thing when 'date' fails. (Guess that's why it's considered an experimental feature!) >I won't hack bounce.pl for fear of doing wrong, but yes I'm pretty >sure I agree now that it should set x-archive-with-date from the >original Recieved: fields if possible. It would also have to be intelligent enough to make sure its output is reasonable and machine readable, even if the input is less than clear. Additionally, it would need to work in such a way that if it was run twice over the same message, it would not mess up. (i.e. just preserve the x-arcive-with-date if it is already there) Jeff
Re: Archiving by month
Jeff Breidenbach said: > Solution sets that I see are: > > b1) ask mailme to do monthly sorts off of x-archive-with-date > headers primarily, received headers secondarily > b2) modify bounce.pl to generate x-archive-with-date based on > received headers. Attached is b1), then. I've added some sanity checking for the date field so it should be pretty robust. With this implemented, it means that mailme expects bounce.pl (or equivalent) to set the x-archive-with-date field intelligently, and doesn't do any second guessing, which is how I think it should be. If it doesn't find x-archive-with-date, it looks for Received:, and if that's not set it falls over to Date:, which is the way Jeff's rcfile deals with dates too, so is consistent. So now mailme and rcfile agree that, if you're importing, you HAVE to set an intelligent x-archive-with-date field; else, they both use Received:, which goes by the clock on Jeff's server. I won't hack bounce.pl for fear of doing wrong, but yes I'm pretty sure I agree now that it should set x-archive-with-date from the original Recieved: fields if possible. Paul --- mailme.jeff Sun Sep 5 14:08:19 1999 +++ mailme Sun Sep 5 15:25:07 1999 @@ -319,8 +319,26 @@ X14=`echo "$T" | grab "list-post"` # If indexing by month, we care about the date -DATE=`echo "$T" | grab "date"` -JUSTDATE=`echo $DATE | sed 's/^date: //i'` +# If importing, see if x-archive-with-date is set first. +# Use "Date:" as last resort because of mis-set clocks. +XDATE=`echo "$T" | grab "x-archive-with-date"` +[ ! "$XDATE" ] && XDATE=`echo "$T" | grab "received"` +[ ! "$XDATE" ] && XDATE=`echo "$T" | grab "date"` +if [ ! "$XDATE" ] +then + emergency_divert NODATE "Unable to find any date field." + exit -1 +fi +JUSTDATE=`echo "$XDATE" | sed -e 's/^date: //i' \ + -e 's/^x-archive-with-date: //i' \ + -e 's/^received: //i'` +date -d "$JUSTDATE" > /dev/null 2>&1 +ex=$? +if [ "$ex" != "0" ] +then + emergency_divert NODATE "Unable to find valid date field." + exit -1 +fi # Extract email addresses CHANCE=$(echo $TO $CC $X1 $X2 $X3 $X4 $X5 $X6 $X7 $X8 $X9 $X10 $X11 $X14 |\
Re: Archiving by month
Jeff said: > I took a closer look. Attachments are never indexed because of > > echo "exclude_urls: .dir">> $CFG I'm confused here: attachments don't appear in anything called *.dir, but in a subdirectory called msg? for the relevant message - see http://www.mail-archive.com/sinister%40majordomo.net/1997-month-08/msg00174.html Certainly on my home setup they get indexed. If I run: htdig -a -c /etc/htdig/sinister_majordomo_net.conf -vvv it shows them being indexed here. And yes, we could restrict it to just html files, but again, because limit_urls_to is so hopeless, it would mean throwing away the directory check I think, because all you could put was ".html". > So I now feel safe doing your original > > -echo "limit_urls_to:$TARGET/$MAILLIST/" >> $CFG > +echo "limit_urls_to:$TARGET/$MAILLIST/msg" >> $CFG - but the other way round, surely? My change no. 1 yesterday was just to take the "msg" away - a bit of a dirty kludge, but it preserved path info, thus my patch no. 2 which just completely reproduced your limits (including "msg") but for all named subdirectories too. So patch #2 I sent more exactly traces what you did for non-monthlies, with the cost of a little more complexity. I'm sure either would do, as long at "$TARGET/$MAILLIST/msg" itself isn't hardcoded in, as that will break monthlies because it misses all the monthly subdirectories. > By the way, index pages are already protected > from htdig in exactly the correct fashion (don't index the page, > but do follow the links) by the META tag at the top of the page. Ah yes, I missed that - you'll maybe want to add that to the monthme-generated page too then? Re. dates: > Solution sets that I see are: > > a1) ask mailme to do monthly sorts off of received headers > a2) find another way to do imports > > b1) ask mailme to do monthly sorts off of x-archive-with-date > headers primarily, received headers secondarily > b2) modify bounce.pl to generate x-archive-with-date based on > received headers. > > What I don't like is the extra complexity that all this will entail, > and also the fact that it would require duplicating some of the > functionality already found in MHonArc. Again, I'm going to ignore > this all for a while and worry about installing bigger disks. All I'd say is I don't think it's mailme's job to sort this out: the problem lies in the data source, and mailme can't really and shouldn't second-guess what the data *really* meant. Previous discussions on gossip have tended strongly to sorting order as: x-archive-with-date:received:date which archives new data "correctly" (relies on your local server clock) but allows imported data to be dealt with specially if the list manager's prepared to make an effort. So I'd strongly favour b), and yes, why not make bounce.pl use the original Received: headers. Good plan, I wish I had :) If anyone wants a lesson in why relying on Date: isn't a good idea on a reasonably sized (1000 members) list, visit http://www.mail-archive.com/sinister%40majordomo.net/ I'll sort it out when my list's up to date Jeff, I promise. Paul
Re: Archiving by month
I took a closer look. Attachments are never indexed because of echo "exclude_urls: .dir">> $CFG so I now feel safe doing your original -echo "limit_urls_to:$TARGET/$MAILLIST/" >> $CFG +echo "limit_urls_to:$TARGET/$MAILLIST/msg" >> $CFG and I've now done so. By the way, index pages are already protected from htdig in exactly the correct fashion (don't index the page, but do follow the links) by the META tag at the top of the page. Jeff
Re: Archiving by month
The problem has the following facets: Computers with misset clocks screw up 'Date:' MH's 'pick -before/-after' command probably uses 'Date:' Imports emphasize the 'Date:' header Importing by bounce adds an unhelpful 'Received:' header. Solution sets that I see are: a1) ask mailme to do monthly sorts off of received headers a2) find another way to do imports b1) ask mailme to do monthly sorts off of x-archive-with-date headers primarily, received headers secondarily b2) modify bounce.pl to generate x-archive-with-date based on received headers. What I don't like is the extra complexity that all this will entail, and also the fact that it would require duplicating some of the functionality already found in MHonArc. Again, I'm going to ignore this all for a while and worry about installing bigger disks. >Yes, htDig going somewhere it shouldn't might be a problem, I can see >that. But Jeff, hasn't it been indexing attachments anyway already? >limit_urls_to was set to foo/msg which is where the attachments are stored >too, isn't it? Nope. Attachments are placed in subdirectories; this is specified in the default rcfile. As far as fancy htdig configuration tricks, I've found the htdig users mailing list extremely hlful ([EMAIL PROTECTED]) so they may be able to help. Maybe we can restrict htdig to messages only by telling htdig to only read .html files? >That's ok as long as you don't mind me throwing these patches at you >from time to time. I think we're nearly there, honestly. Patches are always appreciated. Cheers, Jeff
Re: Archiving by month
Hi Jeff, First off, I made a silly mistake in monthme - producing links to html in subdirectories of course doesn't work, as none of the relative links work. This is quickly solved by linking to the directory instead which makes a lot more sense anyhow. The patch I've attached (see below) fixes this. > I applied your patch to the monthly index page generator -- and it > looks like quite an improvement. However, relying on the 'Date:' > header is definitely hurting, and something is inconsistant. This may > take some work, possibly including perhaps bounce.pl modifications. > See: > > http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/maillist.html > http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/msg0.html I just saw these and nearly died - actually the code's doing what it should: these dates were actually *in* the mails - misconfigured clients. I'd skimmed through the discussion in gossip on "received date" and stupidly assumed I hadn't a problem with my list. Those who don't learn from history are doomed to repeat it. So I don't think it's a code problem, and goes back instead to the question of how to manage imports. Maybe bounce.pl could do a sanity check on on "Date:" vs. "Received:" within some bounds (Michael?) but that's all I can suggest. Right now I can't see anywhere the mailme/monthme code is flakey in it's date handling, although I could be missing something. Jeff - sorry for these and I think it's best to let my list accumulate now - I'll try and deal with it later and maybe ask you just to zap those, if I can make it easy for you to do: I'll send a simple script or something if that's ok. > The patches to digger/rcfile are good, but I am concerned about htdig > trying to index attachments, which I don't want it to do. I'm afraid > of htdig getting bogged down with some weird mime type. Thus, I > haven't applied the patches until this issue gets > resolved. Suggestions? Yes, htDig going somewhere it shouldn't might be a problem, I can see that. But Jeff, hasn't it been indexing attachments anyway already? limit_urls_to was set to foo/msg which is where the attachments are stored too, isn't it? But looking back I think you're right that my solution (removing "msg"!) isn't very robust. In case anyone's following this, the problem is how to ask htdig to index the following directories under a root starting point of maillist.html: msg*.html -month-??/msg*.html You can add multiple "patterns" to limit_urls_to, but you can't use wildcards ("patterns"??) and multiple strings are or'd, not and'd. And you don't know how many -month-??'s there are. And if you just set it to "msg", you run the risk of a list with "msg" in its name. You also can't do anything creative with a combination of "limit_urls_to" and "limit_normalized" (I tried). Oh and you can't use "exclude_urls" because you don't know index file names for customised lists and whatever else might end up in there. I'm sending you two more diff's with my proposed solution - again diff's from your original source. I propose monthme writes out an list of files for htdig to index, and htdig in digger uses its `...` option to include them, along with your original pattern so monthlies don't break. My version of htdig doesn't seem to mind that the file doesn't exist for non-monthly lists. > Finally, I won't be able to work on these things myself as my current > priority is installing additional disk space. That's ok as long as you don't mind me throwing these patches at you from time to time. I think we're nearly there, honestly. Paul --- ../bin/monthme.jeff Sat Sep 4 17:44:09 1999 +++ monthme Sun Sep 5 00:30:03 1999 @@ -29,6 +29,8 @@ MAILLIST=$1 NICKNAME=$2 +TARGET=http://localhost + # ### Action ### # @@ -41,9 +43,18 @@ CTRAIL=$CONFDIR/trailer-monthly.html [ -f $CHEAD ] && cat $CHEAD >> $MONTHINDEX +MAILLIST=$(echo $MAILLIST | awk '{ print tolower($1) }') +ESCAPED_NAME=$(echo $MAILLIST | tr '@.' '__') + # Start off the page cat >> $MONTHINDEX <$NICKNAME mailing list: +Monthly index for the $NICKNAME mailing list + +Latest Messages by Date + +Latest Messages by Thread + +By month: EOF @@ -59,13 +70,43 @@ # End month list and start search section cat >> $MONTHINDEX < +Search $NICKNAME + - - -Restrict matched files - + +Search options: + + +Match: +All +Any +Boolean + + +Format: +Long +Short + + +Sort by: +Score +Date +Name + + + +Results per page: +10 +20 +50 +100 + + +Restrict search to months: + + EOF # Optional button for searching within each month @@ -75,11 +116,9 @@ # Finish off search section cat >> $MONTHINDEX < - - - + + @@ -90,4 +129,16 @@ ln -s $HOME/archive/$MAILLIST/maillist.html \ $HOME/archive/$MAILLIST/index.html - +# Compile list of subdirs for htdig
Re: Archiving by month
Paul, I applied your patch to the monthly index page generator -- and it looks like quite an improvement. However, relying on the 'Date:' header is definitely hurting, and something is inconsistant. This may take some work, possibly including perhaps bounce.pl modifications. See: http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/maillist.html http://www.mail-archive.com/sinister@majordomo.net/1904-month-06/msg0.html The patches to digger/rcfile are good, but I am concerned about htdig trying to index attachments, which I don't want it to do. I'm afraid of htdig getting bogged down with some weird mime type. Thus, I haven't applied the patches until this issue gets resolved. Suggestions? Finally, I won't be able to work on these things myself as my current priority is installing additional disk space. Jeff
Re: Archiving by month
Hi Jeff, Some more mods to take care of monthly searching (which is currently broken), and a couple of other suggested fixes. I've attached patches for the following (I've diff'd these with the latest source you sent me - let me know if you want the full versions): monthme: I think it lost the ESCAPED_NAME variable somewhere when it was externalised, which is why the searching wasn't working and it wasn't incorporating your html for the results page. I've also added all the search options I can that htDig allows, so that complex searching can be done. Oh, and my "restrict" HTML code was broken, now fixed, so searching by month should work. It's still not very pretty, but it works, feel free to make it look nicer. Finally, I've added some rather inelegant code to create links in the master directory to the "latest" indexes - I think these will be necessary to give list managers a canonical URL which they can reference which will go to the latest messages at any time - many users will want to bookmark this I've found in the past. I'm sure it can be done in a neater way but I can't think how just now. digger: Very minor modification to cope with htDig indexing monthly lists. I didn't know htDig before I started this and I have to say I'm not impressed with the configuration options: you can't include wildcards or "and" clauses where you might want to ("limit_urls_to") as far as I can see, and can't include string lists (only strings) where they would be really useful ("noindex_start") - the latter would mean you could use automatically-generated MHonarc HTML tags to exclude parts of the generated pages, but it won't let you. To make sure htDig doesn't index irrelevant information from index pages, I've also included a modified: rcfile: which extends your own use of the tag to the index pages too - otherwise no other changes in there. I've had to add the above to my own rcfiles, so again I'll send you them in a separate mail. Sorry about the changes, these are to reflect the above in my own files, and also some other fixes. Finally Michael said: > Have you considered also allowing a way to change archives > when an archive accumulates a certain number of messages? I'm sure Jeff's thought about this and I did too a little: the main problem seemed to me that a simple decision based on something like this might not actually be appropriate for lots of lists, e.g. imagine a list that's been going for 5 years and crosses the threshold but actually only has a handful of messages per month -> loads of nearly empty indexes and a list manager asking "why??". I suspect a human would choose to go to monthly indexes when (say) more than half the months had more than 20 (or 30 or 50) messages, which does increase the complexity (and therefore breakability) of Jeff's rather neat and simple code. So I left anything like that out for now. Paul --- ../bin/monthme.jeff Sat Sep 4 17:44:09 1999 +++ monthme Sat Sep 4 20:01:51 1999 @@ -41,9 +41,18 @@ CTRAIL=$CONFDIR/trailer-monthly.html [ -f $CHEAD ] && cat $CHEAD >> $MONTHINDEX +MAILLIST=$(echo $MAILLIST | awk '{ print tolower($1) }') +ESCAPED_NAME=$(echo $MAILLIST | tr '@.' '__') + # Start off the page cat >> $MONTHINDEX <$NICKNAME mailing list: +Monthly index for the $NICKNAME mailing list + +Latest Messages by Date + +Latest Messages by Thread + +By month: EOF @@ -59,13 +68,43 @@ # End month list and start search section cat >> $MONTHINDEX < +Search $NICKNAME + - - -Restrict matched files - + +Search options: + + +Match: +All +Any +Boolean + + +Format: +Long +Short + + +Sort by: +Score +Date +Name + + + +Results per page: +10 +20 +50 +100 + + +Restrict search to months: + + EOF # Optional button for searching within each month @@ -75,11 +114,9 @@ # Finish off search section cat >> $MONTHINDEX < - - - + + @@ -90,4 +127,11 @@ ln -s $HOME/archive/$MAILLIST/maillist.html \ $HOME/archive/$MAILLIST/index.html - +# Create link to latest indexes +# Note: not "this month" - might be no messages yet. +LATESTM=`/bin/ls -d1p $HOME/archive/$MAILLIST/* | grep '\/$' \ + | sort -nr | head -1` +rm -f $HOME/archive/$MAILLIST/latest-maillist.html +rm -f $HOME/archive/$MAILLIST/latest-index.html +ln -s $LATESTM/maillist.html $HOME/archive/$MAILLIST/latest-maillist.html +ln -s $LATESTM/index.html $HOME/archive/$MAILLIST/latest-index.html --- ../bin/digger.jeff Sun Aug 29 19:26:14 1999 +++ digger Sat Sep 4 19:05:45 1999 @@ -87,7 +87,10 @@ echo "nothing_found_file: $CONF/nomatch.html">> $CFG echo "search_results_wrapper: $CONF/wrapper.html">> $CFG -echo "limit_urls_to:$TARGET/$MAILLIST/msg" >> $CFG +#PSM start +#echo "limit_urls_to:$TARGET/$MAILLIST/msg" >> $CFG +echo "limit_urls_to:$TARGET/$MAILLIST" >> $CFG +#PSM en
Re: Archiving by month
>Your plan to allow monthly archives on mail-archive sounds really >good. The sorting engine already knows how to separate different lists from each other; Paul's refinement was to narrow the definition of a list to limit it to a one month timespan -- so to the sorting engine each month is handled independently. The code changes were small, clever, and integrate quite smoothly with the existing code. Best of all, the sort engine doesn't have to save any extra state information between runs. >Have you considered also allowing a way to change archives >when an archive accumulates a certain number of messages? It does not appear to be an obvious extension (to me). Jeff
Re: Archiving by month
Jeff, A couple of belated comments back - you'll probably know the answers to these already by now. > I don't quite follow. Let's say there are a bunch of messages for > 'sinister' in the inbox. So if it's current August 1999, we use pick > to select just those messages from August 1999. But what happens if there > are messages from July 1999 in the inbox? Do they stay there forever? > Or do we have to loop this process over each month from the dawn of > time? What am I missing here? I had the impression that's how your own code worked, by just picking out "similar" mails from the mailbox, based on destination address - seemed sensible to me and just extended this to only picking out mails for the same month. Unless I'm missing something :) > maillist.html --> most recent date index > index.html--> most recent thread index > ???.html --> monthly index (either 'monthly.html' 'meta.html' or > some other new name.) > > This allows consistancy for folks from the outside who link to lists > at mail-archive. Even if htdig starts at maillist.html, it should > have no problem finding the monthly index, as long as there are links > to follow. Symlinks would probably be useful here. My thinking was that no files would reside at the default level where they normally do for non-monthly lists, excepting and index.html and maillist.html - so these fiels wouldn't be replacing anything, as all the MHonarc-generated .html's are shuffled into month-specific sub-directories, and where a no-monthly list would find in $list/maillist.html the latest messages index, a monthly list would just use this as a pointer to the directories below. You probably know this by now anyway: I think you're right, actually, that symlinks should be used: there should be a canonical URL for each list which points to the latest indexes: maillist.html -> -month-MM/mailist.html This should be easy to do. As you say, a cascaded monthly-specific rc file for MHonarc with links to a separate index page (my maillist.html, but named something else) would mean htdig indexes it all just fine. I forgot to think about this, as I'd done so much work on my own rc's. :) Oh and of course, your comments about ownership of code are fine: you're welcome to do as you wish with my code, such as it is. Look forward to seeing your comments, and I'll volunteer my list to act as guinea-pig if you like for monthly indexes: I'm very keen to get it working as soon as I can. Paul
Re: Archiving by month
>1. Firstly, I've dodged your question "Do we need to automatically detect >very low traffic lists? How?" ! The generation of monthly indexes is >simply triggered in my scheme by the presence of a file "monthly" in vault >directory for a specific list. That sounds extremely reasonable. Good idea. >2. If this file is found, pick is called in mailme with -after and -before >to only run as far as mail messages in the same month, and they're all >dumped in a subdirectory of archive/$MAILLIST with the format >"-month-MM" instead of at the top level. MHonarc then works in that >directory for each mailme run. Note: this relies in "pick" not knowing >some months have less than 31 days! Thank goodness - just saves some >code. I don't quite follow. Let's say there are a bunch of messages for 'sinister' in the inbox. So if it's current August 1999, we use pick to select just those messages from August 1999. But what happens if there are messages from July 1999 in the inbox? Do they stay there forever? Or do we have to loop this process over each month from the dawn of time? What am I missing here? >3. mailme also creates a page at the original level, archive/$MAILLIST >called "maillist.html", duplicating the MHonarc-generated file previously. >This serves two purposes: (a) acts as the index page to each month for >every list that's chosen to be monthly; and (b) acts as the starting point >for htdig, because your setup expects a file "maillist.html" to be there >for indexing. This is the neatest way I could think to do it, and I'm >hoping everything else falls in neatly with a minor change or two to the >htdig conf file - I haven't looked at it properly yet. I'd prefer something that keeps the following properties: maillist.html --> most recent date index index.html--> most recent thread index ???.html --> monthly index (either 'monthly.html' 'meta.html' or some other new name.) This allows consistancy for folks from the outside who link to lists at mail-archive. Even if htdig starts at maillist.html, it should have no problem finding the monthly index, as long as there are links to follow. Symlinks would probably be useful here. >Also note the maillist.html must only reference >-month-MM/maillist.html's and not index.html's or the htdig >indexing will duplicate (I think). >Htdig should just index all "msg" files as before, but include those in >subdirectories, obviously. I can look at this further if you'd rather I >did, I'll just need to do a little htdig reading. htdig is reasonably smart. Even if you have 10 links to a given URL for htdig to follow, htdig will only index that URL once. It will happily follow links. I don't anticipate any problems, or even configuration changes. >mailme does a "ln -s maillist.html index.html" for monthly lists at >this level in mailme Maybe not? Currently, 'index.html' is the name of the threaded index, and 'maillist.html' is the name of the date index. The command above will probably attempt (and fail because of the lack of the -f) to clobber the thread index. This is probably not what you are aiming for. >You'll want to create templates, conf/heading-monthly.html and >trailer-monthly.html which will get added to this monthly index file: I >just haven't bothered as you'll want to create your own house style - it'll >run as is without and show you a very basic file anyhow. Sounds good... >Finally mailme adds a search box to this maillist.html with the ability to >search each month via htsearch's option "restrict" - I've sketch it out >with some poor html for now (which I haven't checked) but you get the >idea. Ok, assuming you're talking about the monthly index. I also have not looked into the htdig 'restrict' option but it sounds quite reasonable. >1. Minor (hopefully) changes to digger so that the conf file it produces >causes htdig to follow all links from archive/$MAILLIST/maillist.html and >indexes msg* files in subdirectories too. As stated previously, I think we get this for free. >2. (Optional) decide on a trigger for "touch monthly". Manual will be fine for time being. >3. (Optional) write a conf/header-monthly.html and conf/trailer-monthly.html >and include in mailme. This is cosmetic - we can have some very simple placeholders until the underlying machinery is proven. Overall, these suggestions look great, and I look forward to reading your code. By the way, in the future you may consider using 'diff -uNr' between original files and modified files; it makes it very easy to read the changes. It's also ok submitting changes the way you have, because the source is so small. You (and others) are welcome to send diffs over the list or through personal email. Also, just in case you are interested in software license issues, patches have to be sumbitted under the BSD license or equivalent, which basically means you get a big thank you in the FAQ, and I can use the patch without restriction. As an aside, I may
Archiving by month
Hi Jeff, Re. mods to allow mail-archive deal with some lists on a per-month basis. I've had a good look and will send in a separate mail to you direct a patched mailme.model for monthly indexes. Take a look and see what you think and I'll have a think meantime about the rest of your system, although I do think other changes are trivial (hopefully!). I'll also send on an rcfile for the list "sinister", which I'd be obliged if you'd nuke so I can try out. All my changes are surrounded by "#PSM start" and "#PSM end" in the source - sorry it's a little messy just now. I hope you'll find the changes very simple, I've tried to keep them so. Please note also I haven't ever used MH or htdig before, but I'm pretty sure what I've done is portable. Here's the philosophy: 1. Firstly, I've dodged your question "Do we need to automatically detect very low traffic lists? How?" ! The generation of monthly indexes is simply triggered in my scheme by the presence of a file "monthly" in vault directory for a specific list. It seems to me that any old trigger could be built into mailme to touch this file, and the implementation is dead easy: just touch the file, nuke and rebuild. So I haven't decided what the trigger should be for now. You may just want to touch monthly for lists who ask for it for the moment (mine!), nuke and rebuild? 2. If this file is found, pick is called in mailme with -after and -before to only run as far as mail messages in the same month, and they're all dumped in a subdirectory of archive/$MAILLIST with the format "-month-MM" instead of at the top level. MHonarc then works in that directory for each mailme run. Note: this relies in "pick" not knowing some months have less than 31 days! Thank goodness - just saves some code. 3. mailme also creates a page at the original level, archive/$MAILLIST called "maillist.html", duplicating the MHonarc-generated file previously. This serves two purposes: (a) acts as the index page to each month for every list that's chosen to be monthly; and (b) acts as the starting point for htdig, because your setup expects a file "maillist.html" to be there for indexing. This is the neatest way I could think to do it, and I'm hoping everything else falls in neatly with a minor change or two to the htdig conf file - I haven't looked at it properly yet. You'll want to create templates, conf/heading-monthly.html and trailer-monthly.html which will get added to this monthly index file: I just haven't bothered as you'll want to create your own house style - it'll run as is without and show you a very basic file anyhow. Also note the maillist.html must only reference -month-MM/maillist.html's and not index.html's or the htdig indexing will duplicate (I think). mailme does a "ln -s maillist.html index.html" for monthly lists at this level in mailme, but I suppose index.html could be created with links to all indexes if someone wanted it, because it won't get indexed by htdig. Htdig should just index all "msg" files as before, but include those in subdirectories, obviously. I can look at this further if you'd rather I did, I'll just need to do a little htdig reading. Finally mailme adds a search box to this maillist.html with the ability to search each month via htsearch's option "restrict" - I've sketch it out with some poor html for now (which I haven't checked) but you get the idea. - To Do - 1. Minor (hopefully) changes to digger so that the conf file it produces causes htdig to follow all links from archive/$MAILLIST/maillist.html and indexes msg* files in subdirectories too. 2. (Optional) decide on a trigger for "touch monthly". 3. (Optional) write a conf/header-monthly.html and conf/trailer-monthly.html and include in mailme. I can do any or all of 1-3 if you like - let me know when you've had a look at the work so far. Finally, someone previously asked on the gossip list why bounce.pl suffered from a broken pipe - I think that's when it's not run as root. Let me know what you think, Paul