Re: [scoop] Parsing HTML files correctly

Justin Mason Fri, 27 Apr 2001 10:17:06 -0700
"David A. Desrosiers" said:

>       Ok, clearly I have to spend a weekend and really understand how
> the back-end architechture of sitescooper works.

Thankfully, it's not too tricky!

Whereas most mirroring software gets pages to a specified link depth, and
doesn't differentiate the pages, sitescooper allows you to specify regular
expressions that signify that a page is at a certain depth.

For example, given a multi-section news site like this:

        http://foo.com/                 front page, links to:
        http://foo.com/business/        business news
        http://foo.com/food/            food news (?)
        http://foo.com/blah/            news about blah
        http://foo.com/blah/story512.html       a story page

the sitescooper model is to define regexps like this:

        URL: http://foo.com/
        ContentsURL: http://foo.com/(business|food|blah)/
        StoryURL: http://foo.com/\S+/story\d+.html

Typically each of those "levels" has a different page layout, so being
able to differentiate them like this means you can define what bits of the
page HTML to strip, based on what level the page is at.

--j.

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk
Re: [scoop] Parsing HTML files correctly

Reply via email to