here's a site file for Linux Weekly News (lwn.net), specifically the
"Weekly Edition" (<http://lwn.net/current>, as linked to on the front
page).

note: the scoops can get pretty big (ie 285079 KB) as i scoop the Weekly
Edition and any "Full Story" links, but not "Comments" links (but i do
get the comments that are at the end of "Full Story" links, which i may
work on pruning off with StoryHTMLPreProcess, but there usually aren't
any comments so it's not a high priority).

if you don't want to follow "Full Story" links, so as to reduce the size
of the scoop, then just comment out the StoryFollowLinks directive in
the site file, though you'll still follow "Full Story" links that are
found on the first page.  i tried to point sitescooper to just the first
page and have it automagically follow the "Next page:" links (as
advertised), but it didn't work and i haven't look at the sitescooper
code to see what exactly it was looking for and if LWN's "Next page:"
qualified.  is there a way to manually specify what it looks for (ie a
directive) in following consecutive "pages"?

so to follow the "Next page:" links, i had to specify a StoryURL, which
the "Full Story" links fit, but not the "Comments" links (but only
differ by a trailing forward-slash).

just created this two or three weeks ago, and since the weekly edition
is only published once a week, i haven't had many editions/samples to
test the site file against.  heck, i've been too busy to even read the
scoops except for the first one, but at least the site file scoops
something (which is more than the old site file).

anyways...
-- 

PLEASE REQUEST PERMISSION TO REDISTRIBUTE
   AUTHOR'S COMMENTS OR EMAIL ADDRESS.
# Linux Weekly News sitescooper site file
URL: http://lwn.net/current/
  Name: Linux Weekly News
  Levels: 2
  ContentsStart: <!-- template MiddleColumn -->
  ContentsEnd: <!-- Below ends the full table. -->

  StoryStart: <!-- template MiddleColumn -->
  StoryEnd: <!-- Below ends the full table. -->

  StoryURL: http://lwn.net/Articles/\d+/

  StoryHeadline: <tr><td class="Headline"><div class="C2HL"><b>(.*?)</b></div>

  StoryFollowLinks: 1

Reply via email to