Hi all,

I've just joined the list, so apologies if I'm repeating somebody else's work.

Recently the Globe and Mail (Canada's national newspaper) did a major site
redesign, and all of my sitescooper .site files stopped working.

I spent a little time this weekend, and came up with the attatched updates:

    globe_and_mail_national.site    - National news
    globe_and_mail_columnists.site  - Noted Columnists
    globe_and_mail_toronto.site     - Toronto news
    globe_and_mail_thearts.site     - Arts and Entertainment

One problem I've noticed is that sometimes the same stories get scooped
more than once.  I think this is because the Story URLs contain a lot of
parameters and this fools sitescooper into thinking that it hasn't seen
URLs when it actually has.

Is there a way around this?  I'm thinking of some kind of URL
transformation hook that would run right before sitescooper ran its cache
check.  Then I could strip out all but the essential story info from the URL.

BTW, to the authors of sitescooper: thanks for such a great program!

Michael
# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the Stories presented in
# the paper's National Toronto section.

URL: http://www.globeandmail.com/generated/hubs/current/nationalToronto.html
  Name: G&M Toronto
  Levels: 2

  ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
  ContentsEnd: Complete Index of Today's Print Headlines</b></font></a>

  # StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
  StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*

  StoryStart: <!-- Full Story Header -->
  StoryEnd: <!-- Full Story Footer -->


# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.

StoryPostProcess: {
    s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}



# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the National news stories
# presented on the paper's homepage.


URL: http://www.globeandmail.com/national/
  Name: G&M National
  Levels: 2

  ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
  ContentsEnd: <b>Additional National Stories</b>

  # Use the following if you want to include the "Additional National
  # Stories" at the bottom of the page:
  #
  # ContentsEnd: <!-- /fragments/completeheadlineindex.html begins -->

  # StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
  StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*

  StoryStart: <!-- Full Story Header -->
  StoryEnd: <!-- Full Story Footer -->


# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.

StoryPostProcess: {
    s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}


# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the stories presented in
# the paper's "The Arts" section.

URL: http://www.globeandmail.com/thearts/
  Name: G&M The Arts
  Levels: 2

  ContentsStart: <!-- /fragments/nav/HubNav_TheArts.html ends -->
  ContentsEnd: <!-- /fragments/completeheadlineindex.html begins -->

  # StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
  StoryURL: http://www\.globeandmail\.com/servlet\S*hub=thearts\S*

  StoryStart: <!-- Full Story Header -->
  StoryEnd: <!-- Full Story Footer -->

# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.

StoryPostProcess: {
    s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}

# The Globe and Mail is a general interest newspaper
# based in Toronto, Canada.
#
# This script scoops the Stories presented in
# the paper's National Columnists section.

URL:  http://www.globeandmail.com/generated/hubs/current/nationalColumnists.html
  Name: G&M Columnists
  Levels: 2

  ContentsStart: <!-- /fragments/nav/HubNav_National.html ends -->
  ContentsEnd: Complete Index of Today's Print Headlines</b></font></a>

  # StoryURL: http://www\.globeandmail\.com/servlet/\S*&hub=national
  StoryURL: http://www\.globeandmail\.com/servlet\S*hub=national\S*

  StoryStart: <!-- Full Story Header -->
  StoryEnd: <!-- Full Story Footer -->


# This story processor slows things down a lot, but
# it removes the annoying text "PRINT EDITION" that
# appears above every story.

StoryPostProcess: {
    s{^<.*><b>PRINT EDITION</b><.*>$}{}mg;
}



-- 
Michael Graham
[EMAIL PROTECTED]

Reply via email to