RE: -R and HTML files

Barnett, Rodney Thu, 23 Aug 2007 07:04:53 -0700

> -----Original Message-----
> From: Matthias Vill [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, August 23, 2007 7:41 AM
> To: [email protected]
> Subject: Re: -R and HTML files
> 
> Barnett, Rodney schrieb:
> >  
> > 
> >> -----Original Message-----
> >> From: Matthias Vill [mailto:[EMAIL PROTECTED]
> >> Sent: Thursday, August 23, 2007 1:54 AM
> >> To: [email protected]
> >> Subject: Re: -R and HTML files
> >>
> >> Micah Cowan schrieb:
> >>> Josh Williams wrote:
> >>>> On 8/22/07, Micah Cowan <[EMAIL PROTECTED]> wrote:
> >>>>> What would be the appropriate behavior of -R then?
> >>>> I think the default option should be to download the 
> html files to 
> >>>> parse the links, but it should discard them afterwards 
> if they do 
> >>>> not match the acceptance list.
> >>> Heh, that _is_ the current default. But I'm not convinced that's 
> >>> what the naïve user is going to expect the default to be. 
> Especially 
> >>> since the manpage doesn't mention it, and the info page only 
> >>> mentions it if you dig into the details section.
> >>>
> >>> OTOH, it has a history, so choosing to change it is not a small 
> >>> decision.
> >>>
> >> To me downloading of HTML-files which match 
> rejection-patterns make 
> >> no sense.
> >> Of course, there is this case, where you want "the whole 
> site, but" 
> >> lets say you don't want any of the pictures because they 
> are to big.
> > 
> > In my case, there's a web site that contains a lot of text and PDF 
> > files that I need to monitor so that I can process the new 
> or changed 
> > ones.  The HTML pages merely reflect the directory 
> structure.  I don't 
> > want them, but they have to be traversed to get to the files I do 
> > want.
> 
> Ok, than we maybe need a special parse-and-delete filter.
> Even in you case I believe that you don't want to follow 
> links to some of-pdf-tree HTML-files, which will contain 
> pictures and links to even more HTML-files you don't need. 
> Downloading & parsing all of them is quite an overhead if 
> they reside in other path, I would guess.


I'm not sure what "of-pdf-tree" means, but I think the gist of
your statement is that it may be useful to control not only
which files are kept, but also which files are traversed.  I
would definitely agree with that, especially given Micah's
comment about adding the ability to parse new types of files for
links.  In the particular situation I mentioned, I certainly
wouldn't want wget parsing any PDFs or CSS files for more links.
I can also see that I might want to skip certain sub-directories
(i.e., wnat to avoid parsing certain HTML files).

> So maybe
> -R *outside* -C *listing-html* -A *pdfs* {-C meaning 
> consider-for-links-only (and being one of the few single-chars left)}
> 
> would help and still this could be extended in a 
> mime:url-pattern way, which I guess is really useful for some 
> wikis, which don't append special type-endings to their paths.
> 
> Also -C could default to the -R value to provide 
> compatibility with previous versions and something like -C - 
> would hard-reject everything in -R
> 
> I hope that's a better suggestion now...

Looks like it accomodates my situation.  Thanks.

Rodney

CONFIDENTIALITY NOTICE: 
        This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is privileged,
confidential and exempt from disclosure under applicable law.  If the reader of 
this message is not the intended recipient or the employee or agent responsible 
for delivering this message to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited.  If you have received this
communication in error, please notify us immediately by email reply.

RE: -R and HTML files

Reply via email to