[EMAIL PROTECTED] said:
> after many experiments I have not figured out, how to scoop from a site of
> my interest.
>
> the base url is: www.pro-linux.de
> on that page there is only one link of interest:
> www.pro-linux.de/news/old/index.html
> I want to start from there.
>
> this page has links to content urls named by a short form for month. they
> are grouped
> in directories ordered by years. examples are:
> http://www.pro-linux.de/news/old/2001/Apr.html
> or http://www.pro-linux.de/news/old/2000/Dez.html
>
> the page www.pro-linux.de/news/old/index.html is growing every month by one
> link (the new month).
>
> like on the current page for may
> (http://www.pro-linux.de/news/old/2001/Mai.html) there are links
> to the stories, for example: http://www.pro-linux.de/news/2001/2999.html
>
> the monthly pages grow every day.
>
> my idea is now to have a site file for a 3 levels, starting with
> www.pro-linux.de/news/old/index.html
> get the issue pages like http://www.pro-linux.de/news/old/2001/Mai.html but
> the next month
> http://www.pro-linux.de/news/old/2001/Jun.html too and so on and finally the
> story pages linked form the
> issue pages.
>
> I created the following site file:
>
> # This is a sitescooper site file. see http://sitescooper.tsx.org/
> # by Michael Tepperis-von der Ohe, Version 0.1, 20.04.2001
> URL: http://www.pro-linux.de/news/old/index.html
> Name: pro-linux
> Levels: 3
> IssueFollowLinks: 1
> IssueURL: http://www.pro-linux.de/news/old/index.html
> IssueURL: http://www.pro-linux.de/news/old/2001/Apr\.html
> IssueURL: http://www.pro-linux.de/news/old/2001/Mai\.html
> StoryURL: http://www.pro-linux.de/news/2001/\d+\.html
Try this one, it seems to work:
# This is a sitescooper site file. see http://sitescooper.tsx.org/
# by Michael Tepperis-von der Ohe, Version 0.1, 20.04.2001
URL: http://www.pro-linux.de/news/old/index.html
Name: pro-linux
Levels: 3
ContentsURL: http://www.pro-linux.de/news/old/2001/[A-Z]\S+\.html
StoryURL: http://www.pro-linux.de/news/2001/\d+\.html
Note the "ContentsURL" instead of "IssueURL". "Issue" pages are all
at level 3, whereas you want to pick up pages "in the middle", in
sitescooper terms, "Contents" pages.
The only other change is to the URL used in the ContentsURL line,
it now uses [A-Z]\S+\.html for the file pattern, ie. any filename
that starts with a capital letter, contains no spaces, and ends in
".html".
Another enhancement would be to use the magic pattern [[YYYY]] instead
of "2001". Sitescooper will expand this to the current year.
--j.
_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk