"Dwight D. McKay" said:
> I'm trying to create a site file for First Monday
> (http://www.firstmonday.org) and I've run into a problem. Sitescooper
> retreives nothing from the site. If I grab the current issue page with
> wget and examine it, i find that the page has carriage returns rather then
> line feeds at the end of each line.
> Is there a way around this problem?
Funny, I was thinking of doing a site file for it as well, they have
some good writing up there!
I ran into the same problem, and was a little stumped, so I left it for a
while; I later ran into a similar problem site, and worked around that,
and I would theorise the firstmonday problem is similar.
First of all, the carriage return-vs-line feed issue should not cause any
trouble; sitescooper acts like a good HTML-displaying user agent and
treats all of them as just plain whitespace.
OK -- I've taken a look... here's the details.
Basically, sitescooper uses the LWP library to get the URLs, and it also
sets a maximum size for downloaded objects of 2 megs. LWP implements
this using the HTTP "Range" header, and the web server firstmonday.org
use, WebSTAR, interprets this incorrectly.
As a result, to scoop this site, you need to comment out, or delete, this line
in "lib/Sitescooper/Main.pm":
$self->{useragent}->max_size (1024*1024*2); # 2-meg file limit
Then it works fine!
I've checked in this site file, BTW:
URL: http://firstmonday.org/issues/current_issue/
Name: First Monday
Description: a peer-reviewed journal on the internet
Levels: 2
TableRender: flatten
StoryURL: http://firstmonday.org/issues/issue\S+/\S+/(index.html|)
StoryURL: http://firstmonday.org/issues/current_issue/\S+/(index.html|)
ImageURL: .*/img/.*\.gif
ImageScaleToMaxWidth: 150
AuthorName: Dwight D. McKay and Justin Mason
cheers,
--j.
_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk