Hi all,
I've just joined the list, so apologies if I'm repeating somebody else's work.
Recently the Globe and Mail (Canada's national newspaper) did a major site
redesign, and all of my sitescooper .site files stopped working.
I spent a little time this weekend, and came up with the attatched updates:
globe_and_mail_national.site - National news
globe_and_mail_columnists.site - Noted Columnists
globe_and_mail_toronto.site - Toronto news
globe_and_mail_thearts.site - Arts and Entertainment
One problem I've noticed is that sometimes the same stories get scooped
more than once. I think this is because the Story URLs contain a lot of
parameters and this fools sitescooper into thinking that it hasn't seen
URLs when it actually has.
Is there a way around this? I'm thinking of some kind of URL
transformation hook that would run right before sitescooper ran its cache
check. Then I could strip out all but the essential story info from the URL.
BTW, to the authors of sitescooper: thanks for such a great program!
Michael
# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the Stories presented in
# the paper's National Toronto section.
URL: http://www.globeandmail.com/generated/hubs/current/nationalToronto.html
Name: G&M Toronto
Levels: 2
ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
ContentsEnd: Complete Index of Today's Print Headlines</b></font></a>
# StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*
StoryStart: <!-- Full Story Header -->
StoryEnd: <!-- Full Story Footer -->
# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.
StoryPostProcess: {
s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}
# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the National news stories
# presented on the paper's homepage.
URL: http://www.globeandmail.com/national/
Name: G&M National
Levels: 2
ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
ContentsEnd: <b>Additional National Stories</b>
# Use the following if you want to include the "Additional National
# Stories" at the bottom of the page:
#
# ContentsEnd: <!-- /fragments/completeheadlineindex.html begins -->
# StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*
StoryStart: <!-- Full Story Header -->
StoryEnd: <!-- Full Story Footer -->
# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.
StoryPostProcess: {
s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}
# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the stories presented in
# the paper's "The Arts" section.
URL: http://www.globeandmail.com/thearts/
Name: G&M The Arts
Levels: 2
ContentsStart: <!-- /fragments/nav/HubNav_TheArts.html ends -->
ContentsEnd: <!-- /fragments/completeheadlineindex.html begins -->
# StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
StoryURL: http://www\.globeandmail\.com/servlet\S*hub=thearts\S*
StoryStart: <!-- Full Story Header -->
StoryEnd: <!-- Full Story Footer -->
# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.
StoryPostProcess: {
s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}
# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the Stories presented in
# the paper's National Columnists section.
URL: http://www.globeandmail.com/generated/hubs/current/nationalColumnists.html
Name: G&M Columnists
Levels: 2
ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
ContentsEnd: Complete Index of Today's Print Headlines</b></font></a>
# StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*
StoryStart: <!-- Full Story Header -->
StoryEnd: <!-- Full Story Footer -->
# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.
StoryPostProcess: {
s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}
--
Michael Graham
[EMAIL PROTECTED]