here are some site files that i downloaded from sitescooper.org and updated a few months ago, but never got around to emailing them to this list.
secondly, i have included a script that i just created to grab "yesterday's" slashdot stories with comments. i think at the time that i originally updated the site files, i did check my version against cvs (but not learning that the updated versions were in cvs and not at sitescooper.org until AFTER i updated them; oh well) and mine were more recent/specific. but i'm too lazy to check cvs today (i've had enough of site files for one day ;-), so who knows if these are redundant or of any use to anybody. ymmv. the second is a script that i created for scooping slashdot because i got tired of the uselessness the regular site file is for me. let me explain. i primarily read slashdot for the comments/postings, not the news stories. in "real-time" (while postings/moderatings are still being made) i read at a threshold of "4", but after everything has settled down (a day later), i read at "5". well, the problem with the standard slashdot site file is that at 7 am when i'm running sitescooper before leaving for work, there are usually a few stories that are posted just before that time that haven't been up long enough to attract good postings (or the postings haven't received thorough moderation). and as stories are posted all throughout the day (and night), this will always be the case that the most recent stories are lacking well-moderated comments. plus, if i scoop one day at 6 am and the next day at 8 am, then i miss the stories posted between 6 and 8 am (because the slashdot front-page only displays stories from roughly the last 24 hours first thing in the morning). but then i started to think, "what if instead of grabbing the slashdot front-page, i instead grapped, 'yesterday's edition', which would include every story from yesterday and would have plenty of comments with ample moderation?" so, not knowing how to make the site file dynamic based on the current date (though i did find some primitive/basic time stuff in the crypto-gram site file), i created a template which gets processed by a script file that also calls sitescooper. maybe someone will find this beneficial. maybe someone can improve on this. maybe someone will have a good laugh. ;-) thanks for sitescooper and the site files. ps i've attached the referenced site files/script as text files rather than archiving them (tarball, zip) as binary files are discouraged on some mailing lists, and attached text files can be read through geocrawler. and for those who think i'm insensitive to the bandwidth-challenged (by not compressing the files), i'm emailing this through a 26.4 dial-up connection. :-) pps in the script i log everything (pipe it to tee) so that i can search the log file on a regular basis for "SITE WARNING" to see if a website has changed and broken the corresponding site file.
#!/bin/sh
# file locations
SITESCOOPER_DIR=~/doc/visor/sitescooper
LOG_FILE=$SITESCOOPER_DIR/logs/scoop_sites.`date +%Y%m%d`.log
SITES_DIR=$SITESCOOPER_DIR/sites
PDB_DIR=$SITESCOOPER_DIR/pdb
# put time-date stamp in continuous log file
echo `date` | tee $LOG_FILE
# create site file to scoop yesterday's slashdot stories
CONTENTS_DATE=`date --date="yesterday" +%Y%m%d`
STORY_DATE=`date --date="yesterday" +%y\\\\/%m\\\\/%d`
sed -e "s/<<CONTENTS_DATE>>/$CONTENTS_DATE/" -e "s/<<STORY_DATE>>/$STORY_DATE/" < $SITES_DIR/slashdot.site > ~/tmp/slashdot_$CONTENTS_DATE.site
# scoop & pluck sites
#sitescooper -fullrefresh -mplucker \
sitescooper -mplucker \
-filename "Site_YYYYMMDD" -prctitle "Site: YYYYMMDD" \
-install $PDB_DIR -sites \
~/tmp/slashdot_$CONTENTS_DATE.site \
$SITES_DIR/ars_technica.site \
$SITES_DIR/linuxtoday.site \
$SITES_DIR/debian_weekly_news.site \
$SITES_DIR/weekly_news.site \
$SITES_DIR/crypto_gram.site \
2>&1 | tee -a $LOG_FILE
# delete site file created to scoop yesterday's slashdot stories
rm -f ~/tmp/slashdot_$CONTENTS_DATE.site
exit 0
ars_technica.site
Description: application/gmc-link
crypto_gram.site
Description: application/gmc-link
debian_weekly_news.site
Description: application/gmc-link
# converted to use Palm format site, URL thanks to # http://members.bellatlantic.net/~blumax/plink.html ! # URL: http://linuxtoday.com/indexpalm.php3 Name: Linux Today Levels: 2 ContentsStart: -= Highlighted stories above -- Normal news below =- ContentsEnd: -= Filtered \[less interesting\] news below =- StoryStart: \(Tell your friends\!\) StoryEnd: <LI><A HREF=\"/mailprint.php3\?ltsn=.*\">Mail this story StoryURL: /news_palm.php3.* StoryHeadline: size="6">\s*<B>(.*?)\s*</B> ImageURL: http://linuxtoday.com/pics/.*
# Slashdot.site -- now including comments scored 5 or higher. # TODO: strip out the so-called "funny" comments ;) # # Kornelis Sietsma <korny /at/ sietsma.com>: comments support # jm: fixed again to use light mode throughout #URL: http://slashdot.org/index.pl?light=1&noboxes=1&noicons=1&threshold=5 URL: http://slashdot.org/index.pl?issue=<<CONTENTS_DATE>>&light=1&noboxes=1&noicons=1&threshold=5 Name: Slashdot Levels: 2 ContentsStart: </B></FONT> \]</P> ContentsEnd: <P><P>\[ <FONT size=2><B> StoryURL: http://.*slashdot.org/article.pl\?sid=<<STORY_DATE>>/.* StoryStart: </B></FONT> \]</P> StoryEnd: <P>\[ <FONT size=2><B> # strip out the "login" and "related links" tables, they're irrelevant offline! # added Feb 2 2000 jm # StoryHTMLPreProcess: { # s,<P> </TD><TD> </TD><TD VALIGN="TOP">.*?<INPUT TYPE="submit" NAME="op" VALUE="Reply">,,s; s,<P> </TD><TD> </TD><TD VALIGN="TOP">.*?We are not responsible for them in any way.,,s; } # Because slashdot has so many links allowing views of stories with different # comment levels, formats, etc., we need a way to fix or block them here. # Unfortunately it's a bit tricky so we need to use perl code. We could just # ignore the comments, but I guess that's missing the point of slashdot ;) # added May 18 2000 jm # URLProcess: { # fix the URL; trim out all comment settings and use our own. s{^(http://.*slashdot.org/article.pl\?sid=\d+/\d+/\d+/\d+).*} {$1\&light=1\&noboxes=1\&noicons=1\&mode=nested\&threshold=5}g; if (!m,^http://slashdot.org/index.pl.issue=<<CONTENTS_DATE>>\&light=1\&noboxes=1\&noicons=1, && !/mode=nested\&threshold=5/) { undef $_; # has to include these two; block it if it does not } } # skip URLs that have been archived #StorySkipURL: http://slashdot.org/interviews/\d+/\d+/\d+/\d+.shtml StoryHeadline: <TITLE>Slashdot \| (.*?)</TITLE>
weekly_news.site
Description: application/gmc-link
