[scoop] Re: site files & slashdot script

unlisted Sun, 19 May 2002 21:25:22 -0700

please excuse me while i curse evolution (for guessing mime-types and
encoding files).


let me try this again...

On Sun, 2002-05-19 at 23:00, unlisted wrote:
> here are some site files that i downloaded from sitescooper.org and
> updated a few months ago, but never got around to emailing them to this
> list.
> 
> secondly, i have included a script that i just created to grab
> "yesterday's" slashdot stories with comments.
> 
> i think at the time that i originally updated the site files, i did
> check my version against cvs (but not learning that the updated versions
> were in cvs and not at sitescooper.org until AFTER i updated them; oh
> well) and mine were more recent/specific.  but i'm too lazy to check cvs
> today (i've had enough of site files for one day ;-), so who knows if
> these are redundant or of any use to anybody.  ymmv.
> 
> the second is a script that i created for scooping slashdot because i
> got tired of the uselessness the regular site file is for me.  let me
> explain.  i primarily read slashdot for the comments/postings, not the
> news stories.  in "real-time" (while postings/moderatings are still
> being made) i read at a threshold of "4", but after everything has
> settled down (a day later), i read at "5".  well, the problem with the
> standard slashdot site file is that at 7 am when i'm running sitescooper
> before leaving for work, there are usually a few stories that are posted
> just before that time that haven't been up long enough to attract good
> postings (or the postings haven't received thorough moderation).  and as
> stories are posted all throughout the day (and night), this will always
> be the case that the most recent stories are lacking well-moderated
> comments.  plus, if i scoop one day at 6 am and the next day at 8 am,
> then i miss the stories posted between 6 and 8 am (because the slashdot
> front-page only displays stories from roughly the last 24 hours first
> thing in the morning).
> 
> but then i started to think, "what if instead of grabbing the slashdot
> front-page, i instead grapped, 'yesterday's edition', which would
> include every story from yesterday and would have plenty of comments
> with ample moderation?"  so, not knowing how to make the site file
> dynamic based on the current date (though i did find some
> primitive/basic time stuff in the crypto-gram site file), i created a
> template which gets processed by a script file that also calls
> sitescooper.
> 
> maybe someone will find this beneficial.  maybe someone can improve on
> this.  maybe someone will have a good laugh. ;-)
> 
> thanks for sitescooper and the site files.
> 
> ps i've attached the referenced site files/script as text files rather
> than archiving them (tarball, zip) as binary files are discouraged on
> some mailing lists, and attached text files can be read through
> geocrawler.  and for those who think i'm insensitive to the
> bandwidth-challenged (by not compressing the files), i'm emailing this
> through a 26.4 dial-up connection. :-)
> 
> pps in the script i log everything (pipe it to tee) so that i can search
> the log file on a regular basis for "SITE WARNING" to see if a website
> has changed and broken the corresponding site file.

#!/bin/sh

# file locations
SITESCOOPER_DIR=~/doc/visor/sitescooper
LOG_FILE=$SITESCOOPER_DIR/logs/scoop_sites.`date +%Y%m%d`.log
SITES_DIR=$SITESCOOPER_DIR/sites
PDB_DIR=$SITESCOOPER_DIR/pdb

# put time-date stamp in continuous log file
echo `date` | tee $LOG_FILE

# create site file to scoop yesterday's slashdot stories
CONTENTS_DATE=`date --date="yesterday" +%Y%m%d`
STORY_DATE=`date --date="yesterday" +%y\\\\/%m\\\\/%d`
sed -e "s/<<CONTENTS_DATE>>/$CONTENTS_DATE/" -e "s/<<STORY_DATE>>/$STORY_DATE/" < $SITES_DIR/slashdot.site > ~/tmp/slashdot_$CONTENTS_DATE.site

# scoop & pluck sites
#sitescooper -fullrefresh -mplucker \
sitescooper -mplucker \
    -filename "Site_YYYYMMDD" -prctitle "Site: YYYYMMDD" \
    -install $PDB_DIR -sites \
    ~/tmp/slashdot_$CONTENTS_DATE.site \
    $SITES_DIR/ars_technica.site \
    $SITES_DIR/linuxtoday.site \
    $SITES_DIR/debian_weekly_news.site \
    $SITES_DIR/weekly_news.site \
    $SITES_DIR/crypto_gram.site \
    2>&1 | tee -a $LOG_FILE

# delete site file created to scoop yesterday's slashdot stories
rm -f ~/tmp/slashdot_$CONTENTS_DATE.site

exit 0

# Ars Technica sitescooper site file
URL: http://arstechnica.com/index.html
  Name: Ars Technica
  StoryStart: <!--  The following content will be replaced by the code that pulls in 
the "News" content for the center area of the home page. -->
  StoryEnd: <br><i><small>Powered by <a href="http://www.amphibianweb.com/coranto/"; 
target="_blank">Coranto</a></small></i><br>
  StoryDiff: 1

# Crypto-Gram sitescooper site file
URL: http://www.counterpane.com/crypto-gram.html
  Name: Crypto-Gram
  Levels: 2
  ContentsStart: <P class="black-text">Our <a href=".privacy">privacy statement</a> is 
below.
  ContentsEnd: <P class="black-text"><BR><STRONG class="black-bold-text"><a 
name="trans">Translations</a></STRONG>
  StoryStart: <html>
  StoryEnd: CRYPTO-GRAM is a free monthly newsletter
  StoryURL: /crypto-gram-[[YY]]([[MM]]|[[MM-1]]|[[MM-2]]|[[MM-3]])\.html
# fixed by Derek Glidden <dglidden /at/ illusionary.com>

# Debian Weekly News sitescooper site file
URL: http://www.debian.org/News/weekly/current/issue/
  Name: Debian Weekly News
  StoryStart: <H1>
  StoryEnd: To receive this newsletter weekly in your mailbox
  StoryURL: http://www.debian.org/News/weekly/current/issue/

# Linux Today sitescooper site file
# converted to use Palm format site, URL thanks to
# http://members.bellatlantic.net/~blumax/plink.html !
#
URL: http://linuxtoday.com/indexpalm.php3
  Name: Linux Today
  Levels: 2
  ContentsStart: -= Highlighted stories above -- Normal news below =-
  ContentsEnd: -= Filtered \[less interesting\] news below =-
  StoryStart: \(Tell your friends\!\)
  StoryEnd: <LI><A HREF=\"/mailprint.php3\?ltsn=.*\">Mail this story
  StoryURL: /news_palm.php3.*
  StoryHeadline: size="6">\s*<B>(.*?)\s*</B>
  ImageURL: http://linuxtoday.com/pics/.*

# Slashdot sitescooper site file
#
# Kornelis Sietsma <korny /at/ sietsma.com>: comments support
# jm: fixed again to use light mode throughout

#URL:           http://slashdot.org/index.pl?light=1&noboxes=1&noicons=1&threshold=4
URL:            
http://slashdot.org/index.pl?issue=<<CONTENTS_DATE>>&light=1&noboxes=1&noicons=1&threshold=4
Name:           Slashdot
Levels:         2

ContentsStart:  </B></FONT> \]</P>
ContentsEnd:    <P><P>\[ <FONT size=2><B>

StoryURL:       http://.*slashdot.org/article.pl\?sid=<<STORY_DATE>>/.*
StoryStart:     </B></FONT> \]</P>
StoryEnd:       <P>\[ <FONT size=2><B>

# strip out the "login" and "related links" tables, they're irrelevant offline!
# added Feb  2 2000 jm 
#
StoryHTMLPreProcess: {
#  s,<P>&nbsp;</TD><TD>&nbsp;</TD><TD VALIGN="TOP">.*?<INPUT TYPE="submit" NAME="op" 
VALUE="Reply">,,s;
  s,<P>&nbsp;</TD><TD>&nbsp;</TD><TD VALIGN="TOP">.*?We are not responsible for them 
in any way.,,s;
}

# Because slashdot has so many links allowing views of stories with different
# comment levels, formats, etc., we need a way to fix or block them here.
# Unfortunately it's a bit tricky so we need to use perl code. We could just
# ignore the comments, but I guess that's missing the point of slashdot ;)
# added May 18 2000 jm
#
URLProcess: {
  # fix the URL; trim out all comment settings and use our own.
  s{^(http://.*slashdot.org/article.pl\?sid=\d+/\d+/\d+/\d+).*}
        {$1\&light=1\&noboxes=1\&noicons=1\&mode=nested\&threshold=4}g;
        
  if 
(!m,^http://slashdot.org/index.pl.issue=<<CONTENTS_DATE>>\&light=1\&noboxes=1\&noicons=1,
        && !/mode=nested\&threshold=4/)
  {
    undef $_;           # has to include these two; block it if it does not
  }
}

# skip URLs that have been archived
#StorySkipURL:   http://slashdot.org/interviews/\d+/\d+/\d+/\d+.shtml
StoryHeadline:  <TITLE>Slashdot \| (.*?)</TITLE>

# Linux Weekly News sitescooper site file
URL: http://www.lwn.net/
  Name: Linux Weekly News
  Levels: 2
  ContentsStart: <!-- Leading stuff goes here --->
  ContentsEnd: <td bgcolor=".ffffcc">&nbsp;</td></tr>\s*</table>

  StoryStart: </td> <td valign="top">
  StoryEnd: <td bgcolor=".ffffcc">&nbsp;</td></tr>\s*</table>

  StoryURL: http://.*lwn.net/[[YYYY]]/([[MM]]|[[MM-1]])\d\d/\S+.php3
  StoryURL: http://.*lwn.net/features/.*

  StoryHeadline: <h1><a name=".*">(.*?)</a></h1>

  StoryHTMLPreProcess: {
    s,<td valign="top" bgcolor=".ffffcc" width=150>\s*<img src="/images/sp.gif" 
height=1 width=150 alt=""><br><b>.*?</td></tr><tr><td bgcolor=".ffffcc">&nbsp;</td><td 
bgcolor=".ffffcc"><p align=right>,,s;
  }

[scoop] Re: site files & slashdot script

Reply via email to