Re: [Bug-wget] Check external reference, but don't process further

2018-11-27 Thread Darshit Shah
Hi Fernando,

As far as I'm aware there is no way to limit the recursion depth only on
foreign hosts. Something like this would definitely be a lot easier to do using
Wget2 which offers a few more powerful tools that Wget does. Wget2's alpha is
currently available in the Debian repositories and Arch Linux's AUR.

If you'd still like to continue using Wget, one way to pull this off would be
to have Wget print its debug output and then parse that to extract all the URIs
on foreign hosts. You can then have a second invokation of Wget to test for
their existence. An example of doing this would be:

$ wget -r --spider -d exmaple.com | grep -B1 "This is not the same hostname as 
the parent's" | grep "Deciding whether to enqueue" | sed 
's/.*\"\(.*\)\"\./\1/g' | wget --spider -i-

Of course, you may want to modify this to meet your own needs, but the general
idea should work for you

* Fernando Gont  [181127 13:08]:
> Folks,
> 
> I'm using wget in a script to check for broken links in a web site,
> which uses the "--spider" mode.
> 
> I'd like wget to operate in recursive mode for pages in the target
> domain, but not for pages in other hosts/sites.
> 
> That is, if I'm crawling www.example.com, I'd like wget to process all
> pages in that domain recursively. However, if there's a link to an
> external site, I just want wget to check that URL, but not process that
> external reference recursively.
> 
> "-D" would seem to prevent checking external references, so I cannot use
> it. And "--level" would mean that pages on external sites my still be
> processed recursively.
> 
> Any advice on how to implement this?
> 
> Thanks!
> 
> Cheers,
> Fernando
> 
> 
> 
> 
> -- 
> Fernando Gont
> SI6 Networks
> e-mail: fg...@si6networks.com
> PGP Fingerprint:  31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492
> 
> 
> 
> 
> 
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6


signature.asc
Description: PGP signature


[Bug-wget] Check external reference, but don't process further

2018-11-27 Thread Fernando Gont
Folks,

I'm using wget in a script to check for broken links in a web site,
which uses the "--spider" mode.

I'd like wget to operate in recursive mode for pages in the target
domain, but not for pages in other hosts/sites.

That is, if I'm crawling www.example.com, I'd like wget to process all
pages in that domain recursively. However, if there's a link to an
external site, I just want wget to check that URL, but not process that
external reference recursively.

"-D" would seem to prevent checking external references, so I cannot use
it. And "--level" would mean that pages on external sites my still be
processed recursively.

Any advice on how to implement this?

Thanks!

Cheers,
Fernando




-- 
Fernando Gont
SI6 Networks
e-mail: fg...@si6networks.com
PGP Fingerprint:  31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492