Barnett, Rodney schrieb:
>
>
>> -----Original Message-----
>> From: Matthias Vill [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, August 23, 2007 1:54 AM
>> To: [email protected]
>> Subject: Re: -R and HTML files
>>
>> Micah Cowan schrieb:
>>> Josh Williams wrote:
>>>> On 8/22/07, Micah Cowan <[EMAIL PROTECTED]> wrote:
>>>>> What would be the appropriate behavior of -R then?
>>>> I think the default option should be to download the html files to
>>>> parse the links, but it should discard them afterwards if
>>>> they do not match the acceptance list.
>>> Heh, that _is_ the current default. But I'm not convinced
>>> that's what the naïve user is going to expect the default
>>> to be. Especially since the manpage doesn't mention it, and
>>> the info page only mentions it if you dig into the details
>>> section.
>>>
>>> OTOH, it has a history, so choosing to change it is not a
>>> small decision.
>>>
>> To me downloading of HTML-files which match rejection-patterns
>> make no sense.
>> Of course, there is this case, where you want "the whole
>> site, but" lets say you don't want any of the pictures
>> because they are to big.
>
> In my case, there's a web site that contains a lot of text and
> PDF files that I need to monitor so that I can process the new
> or changed ones. The HTML pages merely reflect the directory
> structure. I don't want them, but they have to be traversed to
> get to the files I do want.
Ok, than we maybe need a special parse-and-delete filter.
Even in you case I believe that you don't want to follow links to some
of-pdf-tree HTML-files, which will contain pictures and links to even
more HTML-files you don't need. Downloading & parsing all of them is
quite an overhead if they reside in other path, I would guess.
So maybe
-R *outside* -C *listing-html* -A *pdfs*
{-C meaning consider-for-links-only (and being one of the few
single-chars left)}
would help and still this could be extended in a mime:url-pattern way,
which I guess is really useful for some wikis, which don't append
special type-endings to their paths.
Also -C could default to the -R value to provide compatibility with
previous versions and something like -C - would hard-reject everything in -R
I hope that's a better suggestion now...
Matthias