I've found Sitescooper pretty unreliable when it comes to identifying new 
content, and would like to know what I'm doing wrong. Here's two examples.

First, the James Lileks Daily Bleat column. I'm using the following .site 
file for that.

      URL: http://www.lileks.com/bleats/index.html
      Name: Lileks Daily Bleat
      Level1Diff: 1

No matter what I enter for the Diff value, Sitescooper will download that 
column every day, even though it doesn't change on some days. I've played 
with Level1Diff: 0, and IssueDiff, ContentsDiff, StoryDiff--none of them 
seem to have any effect.

I encounter the opposite problem with Jon Carroll's column. Here's the site 
file I'm using there:

         URL: http://www.sfgate.com/columnists/carroll/
         Name: Today's Jon Carroll
         Levels: 2
         ContentsStart: BEGIN COLUMN RESULTS HERE
         ContentsEnd: </TD></TR>

SiteScooper will correctly isolate the one link I'm interested in, but then 
it shows that link as being old, when in fact it's a new column.

I tried using the -refresh switch and even THAT won't work. Here's the 
output of my command:

         C:\sitescooper>perl sitescooper.pl -site jon_carroll.site -refresh 
-doc
         Reading configuration from "C:\sitescooper\sitescooper.cf".
         Using site choices from "C:\sitescooper\tmp\site_choices.txt".
         Restricting to sites: jon_carroll.site
         Checking for availability of the "diff.exe" command...
         Bad command or file name
         SITE START: now scooping site "jon_carroll.site".
         Reading level-2 front page: http://www.sfgate.com/columnists/carroll/
         Found 1 links, examining them.
         Reading: 
http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/2001/
         12/31/DD104631.DTL
         2001-Dec-31: Today's Jon Carroll: no new stories, ignoring.
         SITE END: done scooping site "jon_carroll.site".
         Finished!

On a related note, I'm somewhat confused as to how SiteScooper designates 
the issue, contents and story links. On a three-level site, it would appear 
that the first level is designated the issue level, the second the contents 
level, and the third the story level. But perhaps I'm wrong here--that 
assumption doesn't always seem to work.

So how does SiteScooper decide what's the issue, contents and story links 
on a two-level site (like the Jon Carroll page referenced above) or a 
one-level site (like the Bleat, above)?

Can I simply forget about issue-contents-story and substitute Level1, 
Level2, Level3, etc.? Even there, I find that SiteScooper doesn't always 
seem to be consistent about designating what's level one, what's level 2, etc.


-- 
Mitch Wagner


_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to