I've found Sitescooper pretty unreliable when it comes to identifying new
content, and would like to know what I'm doing wrong. Here's two examples.
First, the James Lileks Daily Bleat column. I'm using the following .site
file for that.
URL: http://www.lileks.com/bleats/index.html
Name: Lileks Daily Bleat
Level1Diff: 1
No matter what I enter for the Diff value, Sitescooper will download that
column every day, even though it doesn't change on some days. I've played
with Level1Diff: 0, and IssueDiff, ContentsDiff, StoryDiff--none of them
seem to have any effect.
I encounter the opposite problem with Jon Carroll's column. Here's the site
file I'm using there:
URL: http://www.sfgate.com/columnists/carroll/
Name: Today's Jon Carroll
Levels: 2
ContentsStart: BEGIN COLUMN RESULTS HERE
ContentsEnd: </TD></TR>
SiteScooper will correctly isolate the one link I'm interested in, but then
it shows that link as being old, when in fact it's a new column.
I tried using the -refresh switch and even THAT won't work. Here's the
output of my command:
C:\sitescooper>perl sitescooper.pl -site jon_carroll.site -refresh
-doc
Reading configuration from "C:\sitescooper\sitescooper.cf".
Using site choices from "C:\sitescooper\tmp\site_choices.txt".
Restricting to sites: jon_carroll.site
Checking for availability of the "diff.exe" command...
Bad command or file name
SITE START: now scooping site "jon_carroll.site".
Reading level-2 front page: http://www.sfgate.com/columnists/carroll/
Found 1 links, examining them.
Reading:
http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/2001/
12/31/DD104631.DTL
2001-Dec-31: Today's Jon Carroll: no new stories, ignoring.
SITE END: done scooping site "jon_carroll.site".
Finished!
On a related note, I'm somewhat confused as to how SiteScooper designates
the issue, contents and story links. On a three-level site, it would appear
that the first level is designated the issue level, the second the contents
level, and the third the story level. But perhaps I'm wrong here--that
assumption doesn't always seem to work.
So how does SiteScooper decide what's the issue, contents and story links
on a two-level site (like the Jon Carroll page referenced above) or a
one-level site (like the Bleat, above)?
Can I simply forget about issue-contents-story and substitute Level1,
Level2, Level3, etc.? Even there, I find that SiteScooper doesn't always
seem to be consistent about designating what's level one, what's level 2, etc.
--
Mitch Wagner
_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/sitescooper-talk