I have a need to use Sitescooper to slice and dice a fairly complicated
site up (to be revealed later), extracting only the core content in the
middle. The layout of the site looks like the gimped up image attached
to this message. In the image (each <td> is numbered), you'll see
sections 9, 10, and 11 and 6 are the sections I need. It occurs in each
article. Right now, I have the article chunk chopped out, that was
fairly easy, but now I need to remove 7, 8, and 12 from the output. How
can I do this? Is there a way to 'foreach' through the articles,
discarding/including the pieces I want? The articles in the attached png
will repeat over and over, and varies per hour/per day. The bright red
you see in the image is the background. I hope it helps visually
separate which parts are articles/etc.
Secondly, is there a way to restrict (ala exclusionlist.txt in Plucker)
the urls which should not be traversed, by name? In this site, when I
use Levels: 2 to get the replies to the articles, it goes *REALLY* deep,
and pulls the actual profile page of the replying party (their
"homepage" in the forum). I only want to get the replies, but not the
page of the replying party (which is linked in 11 in the image, but 10
contains a link to the replies, like '7 Replies to this Article'). I
hope that made sense. So I'd like to traverse to a depth of 2 from the
main page, which follows into 10, but I want to ignore the depth of 2
for items linked in 11.
I also notice that -parallel does not work in the 3.1.2-1 which ships
with Debian unstable, as well as the snapshot from the website as of
today (sitescooper-full). I'll keep poking around in the site_samples
directory to see if I can figure out how to do this.
Once that part is done, I need to pre-process the HTML (can we have a
-preprocess=$script option, so I can pass it through a script and change
things around in the stream?) and begin formatting the output a bit
cleaner than the original site has.
I'll keep fumbling with it. Probably not the best way to teach myself
Sitescooper by attacking a huge complicated problem such as this, but it
gives me some good experience with the innards of sitescooper.
Thanks.
/d
Julienne.png
Description: PNG image
