"Dwight D. McKay" said:

> I'm trying to create a site file for First Monday 
> (http://www.firstmonday.org) and I've run into a problem. Sitescooper 
> retreives nothing from the site. If I grab the current issue page with 
> wget and examine it, i find that the page has carriage returns rather then 
> line feeds at the end of each line.
> Is there a way around this problem?

Funny, I was thinking of doing a site file for it as well, they have
some good writing up there!

I ran into the same problem, and was a little stumped, so I left it for a
while; I later ran into a similar problem site, and worked around that,
and I would theorise the firstmonday problem is similar.

First of all, the carriage return-vs-line feed issue should not cause any
trouble; sitescooper acts like a good HTML-displaying user agent and
treats all of them as just plain whitespace.

OK -- I've taken a look... here's the details.

Basically, sitescooper uses the LWP library to get the URLs, and it also
sets a maximum size for downloaded objects of 2 megs.  LWP implements
this using the HTTP "Range" header, and the web server firstmonday.org
use, WebSTAR, interprets this incorrectly.

As a result, to scoop this site, you need to comment out, or delete, this line
in "lib/Sitescooper/Main.pm":

  $self->{useragent}->max_size (1024*1024*2); # 2-meg file limit

Then it works fine!


I've checked in this site file, BTW:


  URL: http://firstmonday.org/issues/current_issue/
    Name: First Monday
    Description: a peer-reviewed journal on the internet

    Levels: 2
    TableRender: flatten

    StoryURL: http://firstmonday.org/issues/issue\S+/\S+/(index.html|)
    StoryURL: http://firstmonday.org/issues/current_issue/\S+/(index.html|)
    ImageURL: .*/img/.*\.gif
    ImageScaleToMaxWidth: 150

    AuthorName: Dwight D. McKay and Justin Mason


cheers,

--j.

_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to