RE: [scoop] Parsing HTML files correctly

McIntosh, Jim Fri, 27 Apr 2001 10:59:11 -0700

Title: RE: [scoop] Parsing HTML files correctly

Thanks for all the responses to my request , I don't seem to have resolved
the problem though, I think I must be misunderstanding something fundamental
about sitescooper's operation--

I tried the following site file (guardfull.site)--->

URL: http://www.guardian.co.uk/guardian/todays_stories
Name: guardianfull
Levels: 3
ContentsURL : http://www.guardian.co.uk/.*/story/.*\.(htm|html)
StoryURL: http://www.guardian.co.uk/Print/.*\.(html|htm)

<--------------------------------------

and ran it with
perl sitescooper.pl -site site/guardfull.site -html

I expected the resultant html file to contain only pages that matched
http://www.guardian.co.uk/Print/.*\.(html|htm)

instead I got many unwanted pages including , for instance

www.guardian.co.uk/uklatest/
www.guardian.co.uk
www.guardian.co.uk/gu_contacts/blah000.html

and only a couple of Print files.

RE: [scoop] Parsing HTML files correctly

Reply via email to