Re: [scoop] problems identifying new content

Justin Mason Tue, 01 Jan 2002 20:24:07 -0800


Mitch Wagner said:


> First, the James Lileks Daily Bleat column. I'm using the following .site 
> file for that.
> 
>       URL: http://www.lileks.com/bleats/index.html
>       Name: Lileks Daily Bleat
>       Level1Diff: 1
> 
> No matter what I enter for the Diff value, Sitescooper will download that 
> column every day, even though it doesn't change on some days. I've played 
> with Level1Diff: 0, and IssueDiff, ContentsDiff, StoryDiff--none of them 
> seem to have any effect.

I would suggest using StoryDiff in this case. diff problems can be
difficult to figure out, as they rely on the content changing.

Another recommendation is to use StoryStart and StoryEnd to isolate just
the content you want; otherwise extraneous content might trigger a "page
is different" condition, when the interesting stuff in fact has not
changed.

Good site btw ;)

>          SITE START: now scooping site "jon_carroll.site".
>          Reading level-2 front page: http://www.sfgate.com/columnists/carroll/
>          Found 1 links, examining them.
>          Reading: 
> http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/2001/
>          12/31/DD104631.DTL
>          2001-Dec-31: Today's Jon Carroll: no new stories, ignoring.
>          SITE END: done scooping site "jon_carroll.site".

This *should* be OK. hmm.

Could you try the "bleeding edge" development code?  I can't reproduce
this problem with the current code.

> On a related note, I'm somewhat confused as to how SiteScooper designates 
> the issue, contents and story links. On a three-level site, it would appear 
> that the first level is designated the issue level, the second the contents 
> level, and the third the story level. But perhaps I'm wrong here--that 
> assumption doesn't always seem to work.

Other way around:

        Level3 == Issue
        Level2 == Contents
        Level1 == Story

level4, 5 etc. do not have an "english" name, because they are very
rare (thankfully).

> So how does SiteScooper decide what's the issue, contents and story links 
> on a two-level site (like the Jon Carroll page referenced above) or a 
> one-level site (like the Bleat, above)?

Firstly, you tell it how many levels to expect (Levels, obviously);
it then counts down in the tree.

> Can I simply forget about issue-contents-story and substitute Level1, 
> Level2, Level3, etc.? Even there, I find that SiteScooper doesn't always 
> seem to be consistent about designating what's level one, what's level 2, etc.

yep.  But bear in mind that if there's very open patterns for those level
URLs, you can wind up with a story page being misidentified as an extra
Contents page, etc.  To avoid this you make the patterns more restrictive.

--j.

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Re: [scoop] problems identifying new content

Reply via email to