Hi RSSers and sitescoopers --

I developed Sitescooper a few years back ( http://sitescooper.org/ ),
which scrapes news sites, blogs etc. and renders them down to Palm-format
output.  I haven't been using it much myself recently -- I've been getting
more into RSS and reading updates via mail (using rss2mail) that way,
instead of syncing them to my Palm and reading them there.

Recently, I've been running into blogs without decent RSS feeds (ie. short
or missing descriptions or content:encoded parts).

As a result, it occurred to me that Sitescooper could do with an RSS
output mode, which would deal with (a) getting around crappy RSS
0.91-style feeds, and (b) the sites that don't have RSS output at all
(although that's stepping on NewsIsFree's toes a little ;).  It's also a
handy way to scrape into RSS, given that sitescooper has

  - (a) lots of site descriptions which should mostly work (although a few
    are suffering bit rot now),

  - (b) uses the .site file format -- a simple format for rules on how to
    scrape "stories" from news sites effectively,
    
  - (c) has good caching mechanisms, and 
  
  - (d) pretty good support for wierdness like HTTP redirects and
    authentication.

So anyway -- after some hacking, the CVS version of sitescooper now
supports scraping into RSS 2.0.  Some fruits of this can be seen at
http://sitescooper.org/rss/ .  Each .xml is accompanied by the relevant
.site file.

I don't think it's quite ready for a release just yet, but I thought I'd
let you all know about it and get some feedback ;)

--j.


-------------------------------------------------------
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to