Hi David -- the linux/slashdot.site would be a good guide, it does pretty much all of those things! I definitely recommend that you take a look.
> I need to remove 7, 8, and 12 from the output. How > can I do this? Is there a way to 'foreach' through the articles, > discarding/including the pieces I want? not really :( What we've done is that you find 2 patterns in the HTML, e.g. let's imagine comments stating "start of section block" and "end of section block", and use StoryHTMLPreProcess: (or ContentsHTMLPreProcess:) to remove them, using s/start.*?end//gs regexps. Works well enough... > Secondly, is there a way to restrict (ala exclusionlist.txt in Plucker) > the urls which should not be traversed, by name? StorySkipURL: is the pattern you want. multiple ones can be specified. For really complex ones, you can actually *rewrite* the URLs you find into the ones you want to scrape. slashdot.site does this dramatically, by removing default comment settings and inserting our own from any story URLs we follow. > Once that part is done, I need to pre-process the HTML (can we have a > -preprocess=$script option, so I can pass it through a script and change > things around in the stream?) and begin formatting the output a bit > cleaner than the original site has. Story or ContentsHTMLPreProcess: will do this. (IMO it's best to keep that logic within the site file, so it doesn't have outside dependencies...) So for example, see this site file below. I'm imagining there's helpful comments in the HTML, but if there isn't, you're just going to have to pick out bits of distinctive markup. (but then that's HTML scraping for ya after all ;) URL: http://site.com/ Levels: 2 #?? ContentsStart: -- end of right-justified nav menu -- ContentsEnd: -- start of big fill-in form -- ContentsHTMLPreProcess: { # trim out 7 and 12 s/<!-- start of section block.*?end of section block -->//gs; s/<!-- start of reply-to block.*?end of reply-to block -->//gs; # reformat crappy HTML here, too, if you want. e.g. "slashdot.site" # uses this to remove "The Fine Print" about comments etc. } StoryURL: http://site.com/stories/.* StorySkipURL: http://site.com/person/.* _______________________________________________ Sitescooper-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/sitescooper-talk
