[scoop] site files & slashdot script

unlisted Sun, 19 May 2002 20:43:23 -0700

here are some site files that i downloaded from sitescooper.org and
updated a few months ago, but never got around to emailing them to this
list.


secondly, i have included a script that i just created to grab
"yesterday's" slashdot stories with comments.

i think at the time that i originally updated the site files, i did
check my version against cvs (but not learning that the updated versions
were in cvs and not at sitescooper.org until AFTER i updated them; oh
well) and mine were more recent/specific.  but i'm too lazy to check cvs
today (i've had enough of site files for one day ;-), so who knows if
these are redundant or of any use to anybody.  ymmv.

the second is a script that i created for scooping slashdot because i
got tired of the uselessness the regular site file is for me.  let me
explain.  i primarily read slashdot for the comments/postings, not the
news stories.  in "real-time" (while postings/moderatings are still
being made) i read at a threshold of "4", but after everything has
settled down (a day later), i read at "5".  well, the problem with the
standard slashdot site file is that at 7 am when i'm running sitescooper
before leaving for work, there are usually a few stories that are posted
just before that time that haven't been up long enough to attract good
postings (or the postings haven't received thorough moderation).  and as
stories are posted all throughout the day (and night), this will always
be the case that the most recent stories are lacking well-moderated
comments.  plus, if i scoop one day at 6 am and the next day at 8 am,
then i miss the stories posted between 6 and 8 am (because the slashdot
front-page only displays stories from roughly the last 24 hours first
thing in the morning).

but then i started to think, "what if instead of grabbing the slashdot
front-page, i instead grapped, 'yesterday's edition', which would
include every story from yesterday and would have plenty of comments
with ample moderation?"  so, not knowing how to make the site file
dynamic based on the current date (though i did find some
primitive/basic time stuff in the crypto-gram site file), i created a
template which gets processed by a script file that also calls
sitescooper.

maybe someone will find this beneficial.  maybe someone can improve on
this.  maybe someone will have a good laugh. ;-)

thanks for sitescooper and the site files.

ps i've attached the referenced site files/script as text files rather
than archiving them (tarball, zip) as binary files are discouraged on
some mailing lists, and attached text files can be read through
geocrawler.  and for those who think i'm insensitive to the
bandwidth-challenged (by not compressing the files), i'm emailing this
through a 26.4 dial-up connection. :-)

pps in the script i log everything (pipe it to tee) so that i can search
the log file on a regular basis for "SITE WARNING" to see if a website
has changed and broken the corresponding site file.

#!/bin/sh

# file locations
SITESCOOPER_DIR=~/doc/visor/sitescooper
LOG_FILE=$SITESCOOPER_DIR/logs/scoop_sites.`date +%Y%m%d`.log
SITES_DIR=$SITESCOOPER_DIR/sites
PDB_DIR=$SITESCOOPER_DIR/pdb

# put time-date stamp in continuous log file
echo `date` | tee $LOG_FILE

# create site file to scoop yesterday's slashdot stories
CONTENTS_DATE=`date --date="yesterday" +%Y%m%d`
STORY_DATE=`date --date="yesterday" +%y\\\\/%m\\\\/%d`
sed -e "s/<<CONTENTS_DATE>>/$CONTENTS_DATE/" -e "s/<<STORY_DATE>>/$STORY_DATE/" < $SITES_DIR/slashdot.site > ~/tmp/slashdot_$CONTENTS_DATE.site

# scoop & pluck sites
#sitescooper -fullrefresh -mplucker \
sitescooper -mplucker \
    -filename "Site_YYYYMMDD" -prctitle "Site: YYYYMMDD" \
    -install $PDB_DIR -sites \
    ~/tmp/slashdot_$CONTENTS_DATE.site \
    $SITES_DIR/ars_technica.site \
    $SITES_DIR/linuxtoday.site \
    $SITES_DIR/debian_weekly_news.site \
    $SITES_DIR/weekly_news.site \
    $SITES_DIR/crypto_gram.site \
    2>&1 | tee -a $LOG_FILE

# delete site file created to scoop yesterday's slashdot stories
rm -f ~/tmp/slashdot_$CONTENTS_DATE.site

exit 0

ars_technica.site
Description: application/gmc-link

crypto_gram.site
Description: application/gmc-link

debian_weekly_news.site
Description: application/gmc-link

# converted to use Palm format site, URL thanks to
# http://members.bellatlantic.net/~blumax/plink.html !
#
URL: http://linuxtoday.com/indexpalm.php3
  Name: Linux Today
  Levels: 2
  ContentsStart: -= Highlighted stories above -- Normal news below =-
  ContentsEnd: -= Filtered \[less interesting\] news below =-
  StoryStart: \(Tell your friends\!\)
  StoryEnd: <LI><A HREF=\"/mailprint.php3\?ltsn=.*\">Mail this story
  StoryURL: /news_palm.php3.*
  StoryHeadline: size="6">\s*<B>(.*?)\s*</B>
  ImageURL: http://linuxtoday.com/pics/.*

# Slashdot.site -- now including comments scored 5 or higher.
# TODO: strip out the so-called "funny" comments ;)
#
# Kornelis Sietsma <korny /at/ sietsma.com>: comments support
# jm: fixed again to use light mode throughout

#URL:           http://slashdot.org/index.pl?light=1&noboxes=1&noicons=1&threshold=5
URL:            
http://slashdot.org/index.pl?issue=<<CONTENTS_DATE>>&light=1&noboxes=1&noicons=1&threshold=5
Name:           Slashdot
Levels:         2

ContentsStart:  </B></FONT> \]</P>
ContentsEnd:    <P><P>\[ <FONT size=2><B>

StoryURL:       http://.*slashdot.org/article.pl\?sid=<<STORY_DATE>>/.*
StoryStart:     </B></FONT> \]</P>
StoryEnd:       <P>\[ <FONT size=2><B>

# strip out the "login" and "related links" tables, they're irrelevant offline!
# added Feb  2 2000 jm 
#
StoryHTMLPreProcess: {
#  s,<P>&nbsp;</TD><TD>&nbsp;</TD><TD VALIGN="TOP">.*?<INPUT TYPE="submit" NAME="op" 
VALUE="Reply">,,s;
  s,<P>&nbsp;</TD><TD>&nbsp;</TD><TD VALIGN="TOP">.*?We are not responsible for them 
in any way.,,s;
}

# Because slashdot has so many links allowing views of stories with different
# comment levels, formats, etc., we need a way to fix or block them here.
# Unfortunately it's a bit tricky so we need to use perl code. We could just
# ignore the comments, but I guess that's missing the point of slashdot ;)
# added May 18 2000 jm
#
URLProcess: {
  # fix the URL; trim out all comment settings and use our own.
  s{^(http://.*slashdot.org/article.pl\?sid=\d+/\d+/\d+/\d+).*}
        {$1\&light=1\&noboxes=1\&noicons=1\&mode=nested\&threshold=5}g;
        
  if 
(!m,^http://slashdot.org/index.pl.issue=<<CONTENTS_DATE>>\&light=1\&noboxes=1\&noicons=1,
        && !/mode=nested\&threshold=5/)
  {
    undef $_;           # has to include these two; block it if it does not
  }
}

# skip URLs that have been archived
#StorySkipURL:   http://slashdot.org/interviews/\d+/\d+/\d+/\d+.shtml
StoryHeadline:  <TITLE>Slashdot \| (.*?)</TITLE>

weekly_news.site
Description: application/gmc-link

[scoop] site files & slashdot script

Reply via email to