Re: AW: [scoop] sitescooper 3.1.0 released

Justin Mason Wed, 09 May 2001 14:37:20 -0700
[EMAIL PROTECTED] said:

> after many experiments I have not figured out, how to scoop from a site of
> my interest.
> 
> the base url is: www.pro-linux.de
> on that page there is only one link of interest:
> www.pro-linux.de/news/old/index.html
> I want to start from there.
> 
> this page has links to content urls named by a short form for month. they
> are grouped
> in directories ordered by years. examples are:
> http://www.pro-linux.de/news/old/2001/Apr.html
> or http://www.pro-linux.de/news/old/2000/Dez.html
> 
> the page www.pro-linux.de/news/old/index.html is growing every month by one
> link (the new month).
> 
> like on the current page for may
> (http://www.pro-linux.de/news/old/2001/Mai.html) there are links
> to the stories, for example: http://www.pro-linux.de/news/2001/2999.html
> 
> the monthly pages grow every day. 
> 
> my idea is now to have a site file for a 3 levels, starting with
> www.pro-linux.de/news/old/index.html
> get the issue pages like http://www.pro-linux.de/news/old/2001/Mai.html but
> the next month
> http://www.pro-linux.de/news/old/2001/Jun.html too and so on and finally the
> story pages linked form the
> issue pages.
> 
> I created the following site file:
> 
> # This is a sitescooper site file. see http://sitescooper.tsx.org/
> # by Michael Tepperis-von der Ohe, Version 0.1, 20.04.2001
> URL: http://www.pro-linux.de/news/old/index.html
>   Name: pro-linux
>   Levels: 3
>   IssueFollowLinks: 1
>   IssueURL: http://www.pro-linux.de/news/old/index.html
>   IssueURL: http://www.pro-linux.de/news/old/2001/Apr\.html
>   IssueURL: http://www.pro-linux.de/news/old/2001/Mai\.html
>   StoryURL: http://www.pro-linux.de/news/2001/\d+\.html             

Try this one, it seems to work:

  # This is a sitescooper site file. see http://sitescooper.tsx.org/
  # by Michael Tepperis-von der Ohe, Version 0.1, 20.04.2001
  URL: http://www.pro-linux.de/news/old/index.html
    Name: pro-linux
    Levels: 3
    ContentsURL: http://www.pro-linux.de/news/old/2001/[A-Z]\S+\.html
    StoryURL: http://www.pro-linux.de/news/2001/\d+\.html             


Note the "ContentsURL" instead of "IssueURL". "Issue" pages are all
at level 3, whereas you want to pick up pages "in the middle", in
sitescooper terms, "Contents" pages.

The only other change is to the URL used in the ContentsURL line,
it now uses [A-Z]\S+\.html for the file pattern, ie. any filename
that starts with a capital letter, contains no spaces, and ends in
".html".

Another enhancement would be to use the magic pattern [[YYYY]] instead
of "2001". Sitescooper will expand this to the current year.

--j.

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk
Re: AW: [scoop] sitescooper 3.1.0 released

Reply via email to