Thanks for all the responses to my request , I don't seem to have resolved
the problem though, I think I must be misunderstanding something fundamental
about sitescooper's operation--
I tried the following site file (guardfull.site)--->
URL: http://www.guardian.co.uk/guardian/todays_stories
Name: guardianfull
Levels: 3
ContentsURL : http://www.guardian.co.uk/.*/story/.*\.(htm|html)
StoryURL: http://www.guardian.co.uk/Print/.*\.(html|htm)
<--------------------------------------
and ran it with
perl sitescooper.pl -site site/guardfull.site -html
I expected the resultant html file to contain only pages that matched
http://www.guardian.co.uk/Print/.*\.(html|htm)
instead I got many unwanted pages including , for instance
www.guardian.co.uk/uklatest/
www.guardian.co.uk
www.guardian.co.uk/gu_contacts/blah000.html
and only a couple of Print files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jim McIntosh Phone:
Audio Power Amp Section (214) 480-2594
HPL Dept
Semiconductor Group
Texas Instruments
