how to get mirror just a portion of a website ?

2003-11-16 Thread Josh Brooks

Generally, I mirror an entire web site with:

wget --tries=inf -nH --no-parent --random-wait -r -l inf --convert-links
--html-extension www.example.com

But, that is if I am mirroring an _entire_ web site - where the URL looks
like;

www.example.com

BUT, how can I mirror a URL that looks like:

http://www.example.com/~user/dir/

and get everything starting with ~user/dir/ and everything underneath it,
but nothing above it - for instance, if there was a link back to
~user/otherdir/ I would not want to get that.

So basically, I want to mirror ~user/dir/ and below, and follow nothing
else - how can I do that ?

thank.



Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Hello,

I have noticed very unpredictable behavior from wget 1.8.2 - specifically
I have noticed two things:

a) sometimes it does not follow all of the links it should

b) sometimes wget will follow links to other sites and URLs - when the
command line used should not allow it to do that.


Here are the details.


First, sometimes when you attempt to download a site with -k -m
(--convert-links and --mirror) wget will not follow all of the links and
will skip some of the files!

I have no idea why it does this with some sites and doesn't do it with
other sites.  Here is an example that I have reproduced on several systems
- all with 1.8.2:

# wget -k -m http://www.zorg.org/vsound/
--17:09:32--  http://www.zorg.org/vsound/
   = `www.zorg.org/vsound/index.html'
Resolving www.zorg.org... done.
Connecting to www.zorg.org[213.232.100.31]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[  =
] 12,23553.82K/s

Last-modified header missing -- time-stamps turned off.
17:09:32 (53.82 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]


FINISHED --17:09:32--
Downloaded: 12,235 bytes in 1 files
Converting www.zorg.org/vsound/index.html... 2-6
Converted 1 files in 0.03 seconds.


What is the problem here ?  When I run the exact same command line with
wget 1.6, I get this:


# wget -k -m http://www.zorg.org/vsound/
--11:10:06--  http://www.zorg.org/vsound/
   = `www.zorg.org/vsound/index.html'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K - .. .

Last-modified header missing -- time-stamps turned off.
11:10:07 (71.12 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]

Loading robots.txt; please ignore errors.
--11:10:07--  http://www.zorg.org/robots.txt
   = `www.zorg.org/robots.txt'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 404 Not Found
11:10:07 ERROR 404: Not Found.

--11:10:07--  http://www.zorg.org/vsound/vsound.jpg
   = `www.zorg.org/vsound/vsound.jpg'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 27,629 [image/jpeg]

0K - .. .. ..   [100%]

11:10:08 (51.49 KB/s) - `www.zorg.org/vsound/vsound.jpg' saved
[27629/27629]

--11:10:09--  http://www.zorg.org/vsound/vsound-0.2.tar.gz
   = `www.zorg.org/vsound/vsound-0.2.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 108,987 [application/x-tar]

0K - .. .. .. .. .. [ 46%]
   50K - .. .. .. .. .. [ 93%]
  100K - .. [100%]

11:10:12 (46.60 KB/s) - `www.zorg.org/vsound/vsound-0.2.tar.gz' saved
[108987/108987]

--11:10:12--  http://www.zorg.org/vsound/vsound-0.5.tar.gz
   = `www.zorg.org/vsound/vsound-0.5.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 116,904 [application/x-tar]

0K - .. .. .. .. .. [ 43%]
   50K - .. .. .. .. .. [ 87%]
  100K - .. [100%]

11:10:14 (60.44 KB/s) - `www.zorg.org/vsound/vsound-0.5.tar.gz' saved
[116904/116904]

--11:10:14--  http://www.zorg.org/vsound/vsound
   = `www.zorg.org/vsound/vsound'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 3,365 [text/plain]

0K - ...[100%]

11:10:14 (3.21 MB/s) - `www.zorg.org/vsound/vsound' saved [3365/3365]

Converting www.zorg.org/vsound/index.html... done.

FINISHED --11:10:14--
Downloaded: 269,120 bytes in 5 files
Converting www.zorg.org/vsound/index.html... done.


See ?  It gets the links inside of index.html, and mirrors those links,
and converts them - just like it should.  Why does 1.8.2 have a problem
with this site ?  Other sites are handled just fine by 1.8.2 with the same
command line ... it makes no sense that wget 1.8.2 has problems with
particular web sites.

This is incorrect behavior - and if you try the same URL with 1.8.2 you
can reproduce the same results.




The second problem, and I cannot currently give you an example to try
yourself but _it does happen_, is if you use this command line:

wget --tries=inf -nH --no-parent
--directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
--convert-links --html-extension --user-agent=Mozilla/4.0 (compatible;
MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com

At first it will act normally, just going over the site in question, but
sometimes, you will come back to the terminal and see if grabbing all
sorts of pages from totally different sites (!)  I have seen this happen

Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Thank you for the great response.  It is much appreciated - see below...

On Tue, 7 Oct 2003, Hrvoje Niksic wrote:

 www.zorg.org/vsound/ contains this markup:

 META NAME=ROBOTSCONTENT=NOFOLLOW

 That explicitly tells robots, such as Wget, not to follow the links in
 the page.  Wget respects this and does not follow the links.  You can
 tell Wget to ignore the robot directives.  For me, this works as
 expected:

 wget -km -e robots=off http://www.zorg.org/vsound/

Perfect - thank you.


  At first it will act normally, just going over the site in question, but
  sometimes, you will come back to the terminal and see if grabbing all
  sorts of pages from totally different sites (!)

 The only way I've seen it happen is when it follows a redirection to a
 different site.  The redirection is followed because it's considered
 to be part of the same download.  However, further links on the
 redirected site are not (supposed to be) followed.

Ok, is there a way to tell wget not to follow redirects, so it will not
ever do that at all ?  Basically I am looking for a way to tell wget
don't ever get anything with a different FQDN than what I started you
with

thanks.



subscribe wget

2003-10-05 Thread Josh Brooks