[scoop] RE: [scoop]New York Times Message 1242

Kennis Koldewyn Thu, 19 Jul 2001 12:23:16 -0700
----- Shaughan wrote -----
The New York Times site file you attached downloads as nothing but a single
# on Yahoo groups and it has been stripped off the Geocrawler archive.
Please e-mail it to me or repost it, preferably in the body of the
message---given the trouble with attachments. Thanks a lot.
----- End -----

Here you go.  Note that in the HTML file, each link should be on one very
long line (my email client will probably chop them up something terrible).
Ditto for the first ContentsURL line in the site file as well.

- Kennis


----- Start of new_york_times.html -----
<HTML>
<HEAD><TITLE>New York Times</TITLE></HEAD>

<BODY>
<UL>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/world/text/ind
ex.html">International</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/national/text/
index.html">National</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/nyregion/text/
index.html">N.Y. Region</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/politics/text/
index.html">Politics</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/health/text/in
dex.html">Health</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/science/text/i
ndex.html">Science</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages-technology/tex
t/index.html">Technology</A></LI>
  <li><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/opinion/editor
ial/index.html">Editorials</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/opinion/oped/i
ndex.html">Op-Ed</A></LI>
  <LI><A
HREF="http://www.nytimes.com/auth/chk_login?USERID=sitescooper&PASSWORD=site
scooper&is_continue=true&OQ=&URI=http://www.nytimes.com/pages/opinion/letter
s/index.html">Letters</A></LI>
</UL>
</BODY>

</HTML>
----- End of new_york_times.html -----

----- Start of new_york_times.site -----
# New York Times
# Site file for Sitescooper (http://www.sitescooper.org/)
# Written by: Kennis Koldewyn <[EMAIL PROTECTED]>
# Last updated: 2001-01-17

URL: file://c:/Program Files/Sitescooper-3.1.0/new_york_times.html

Name: New York Times
Levels: 3

# Contents declarations:
ContentsStart: </NYT_HEADER
ContentsEnd: <NYT_FOOTER
ContentsURL:
.*/pages/(world|national|nyregion|politics|health|science|opinion)/.*
ContentsURL: .*/pages-technology/.*

# Story declarations:
StoryStart: .*<NYT_HEADLINE
StoryEnd: </NYT_TEXT
StoryURL: .*/\d\d\d\d/\d\d/\d\d/.*
StoryToPrintableSub: s,(.*),\1\?printpage=yes,

# Story pre-processing:
StoryHTMLPreProcess: {
  # Remove lists of online links, inline tables, inline images, etc.:
  s,<NYT_AD.*?</NYT_ADD>,,gis;
  s,<NYT_BANNER.*?</NYT_BANNER>,,gis;
  s,<NYT_INLINEBLURB.*?</?NYT_INLINEBLURB>,,gis;
  s,<NYT_INLINEIMAGE.*?</?NYT_INLINEIMAGE>,,gis;
  s,<NYT_INLINETABLE.*?</?NYT_INLINETABLE>,,gis;
  s,<NYT_LINKS.*?</NYT_LINKS>,,gis;
  s,<NYT_LINKS_ONSITE.*?</?NYT_LINKS_ONSITE>,,gis;
  s,<NYT_LINKS_OFFSITE.*?</?NYT_LINKS_OFFSITE>,,gis;

  # Remove other NYT-specific tags:
  s,<\/?NYT_.*?>,,gim;
}
----- End of new_york_times.site -----


_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk
[scoop] RE: [scoop]New York Times Message 1242

Reply via email to