[scoop] ContentsURL/StoryURL format

McIntosh, Jim Mon, 30 Apr 2001 13:32:01 -0700

Title: ContentsURL/StoryURL format

Thanks for all the responses to my request , I don't seem to have resolved
the problem though, I think I must be misunderstanding something fundamental
about sitescooper's operation--

I tried the following site file (guardfull.site)--->

URL: http://www.guardian.co.uk/guardian/todays_stories
Name: guardianfull
Levels: 3
ContentsURL : http://www.guardian.co.uk/.*/story/.*\.(htm|html)
StoryURL: http://www.guardian.co.uk/Print/.*\.(html|htm)

<--------------------------------------

and ran it with
perl sitescooper.pl -site site/guardfull.site -html

I expected the resultant html file to contain only pages that matched
http://www.guardian.co.uk/Print/.*\.(html|htm)

instead I got many unwanted pages including , for instance

www.guardian.co.uk/uklatest/
www.guardian.co.uk
www.guardian.co.uk/gu_contacts/blah000.html

and only a couple of Print files.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        Jim McIntosh                             Phone:
        Audio Power Amp Section          (214) 480-2594
        HPL Dept
        Semiconductor Group
        Texas Instruments

[scoop] ContentsURL/StoryURL format

Reply via email to