Incorrect 'beautification' of URL?

2002-03-05 Thread Philipp Thomas

When requesting a URL like http://tmp.logix.cz/slash.xp , wget shortens
this to http://tmp.logix.cz/slash.xp/; All Browsers I tested (Opera 6b1,
Mozilla 0.9.8, Konqueror 2.9.2) pass this URL as given.

So the question is, why wget (1.8.1) does what it does and how to possibly
switch off this behaviour.

Philipp

-- 
Philipp Thomas [EMAIL PROTECTED]
SuSE Linux AG, Deutscherrnstr. 15-19, D-90429 Nuremberg, Germany

HPUX and sane have never been uttered in the same sentence
without accompanying negatives.
-- Richard Henderson on gcc ml



Change in behaviour between 1.7 and 1.8.1

2002-03-05 Thread Philipp Thomas

When you issue

 wget --recursive --level=1 --reject=.html www.suse.de

wget 1.7 really ommits downloading all the .html files except index.html
(which is needed for --recursive), but wget 1.8.1 also downloads all .html
files that are referenced from index.html and deletes them immediately.

It is clear that the .html files are needed to find the next level of files
when downloading recursively, but they should be ommitted when the recursion
depth is limited and the limit has been reached.

Philipp

-- 
Philipp Thomas [EMAIL PROTECTED]
SuSE Linux AG, Deutscherrnstr. 15-19, D-90429 Nuremberg, Germany



Re: bOY YOUR PROGRAMME IS GOOD.

2002-03-05 Thread n


Alan Eldridge wrote:

 On Tue, Mar 05, 2002 at 05:40:22AM +, [EMAIL PROTECTED] wrote:

 --- end of quoted message ---

 See what it looks like as text? Please do not post to the mailing list
 in html. It's rude, and I, for one, will neither read nor answer a post
 in html except for a one-time note like this.

Jawohl Herr Obersturmbandfuehrer!!

Ve did not know zatt ze group-members kan nott rrread da HTML.  
Ve suggest you accept the MicroSoft moronized vorrld.  

 Thanks. If you repost as text, I'll take a look at your question.

Alright!

Hi!

First of all, thanks for your greatest of programmes

... a suggestion, or a hassle, you decide:

I wanted to download only the *.ht* files from an ftp-server, including 
subdirectories, but couldn't figure out how to get wget to follow
ftp-directory-listings.

I tried a lot ... starting with
wget -N -r ftp://myname:[EMAIL PROTECTED]/*.htm*
 to
wget -N -m ftp://myname:[EMAIL PROTECTED]/*.htm* --retr-symlinks -r -l 1 -x 
--follow-ftp

and I gave up.  Maybe I'll make


 Also, please post to [EMAIL PROTECTED] Cc: changed to there.

I don't know what that means, but I happily comply.

oops..  

--
cheers,
Norb

To prevent terrorism by dropping bombs on Iraq is such an obvious idea that I can't 
think
why no one has thought of it before. It's so simple. If only the UK had done something
similar in Northern Ireland, we wouldn't be in the mess we are in today.
Terry Jones 17Feb2002

http://www.zmag.org/content/TerrorWar/JonesTerror.cfm



Re: bOY YOUR PROGRAMME IS GOOD.

2002-03-05 Thread Alan Eldridge

On Tue, Mar 05, 2002 at 03:28:39PM +, [EMAIL PROTECTED] wrote:

I wanted to download only the *.ht* files from an ftp-server, including 
subdirectories, but couldn't figure out how to get wget to follow
ftp-directory-listings.

I tried a lot ... starting with
wget -N -r ftp://myname:[EMAIL PROTECTED]/*.htm*
 to
wget -N -m ftp://myname:[EMAIL PROTECTED]/*.htm* --retr-symlinks -r -l 1 -x 
--follow-ftp


There *is* an option to turn on globbing for ftp retrieval (that is
what wildcard-expansion is called in Unixland), but because of the way
it works, it's not going to do what you want.

There is also an accept option, which looks like -A*.ht* for your
case. That won't work either, because the subdirectory names don't
match so you won't recursively descend the directory tree.

So, you've got to approach it backwards, using the reject option (-R
pat[,pat...]). What you need to do figure out what you *don't* want, and
specify that as a pattern list to -R.

E.g., to retrieve everything but jpg or gif files, you would use the option:
 -R*.jpg,*.gif.

When you are using -R to reject things, you want to tell it to just
retrieve everything (then the patterns given to -R become your
filter). To do this, specify the URL with a trailing slash, rather
than using a glob pattern. IOW, use the URL:

ftp://myname:[EMAIL PROTECTED]/

Last little thing: you don't need to specify -N *and* -m. The -m
option implies -N.

 Also, please post to [EMAIL PROTECTED] Cc: changed to there.

I don't know what that means, but I happily comply.

That's the main mailing list address.

-- 
Alan Eldridge
Dave's not here, man.



Re: Incorrect 'beautification' of URL?

2002-03-05 Thread Andre Majorel

On 2002-03-05 11:41 +0100, Philipp Thomas wrote:

 When requesting a URL like http://tmp.logix.cz/slash.xp , wget shortens
 this to http://tmp.logix.cz/slash.xp/. All Browsers I tested (Opera 6b1,
 Mozilla 0.9.8, Konqueror 2.9.2) pass this URL as given.
 
 So the question is, why wget (1.8.1) does what it does

Presumably because the author thought that both URLs are
equivalent. To my surprise, RFC 1945 seems to agree with you. It
says :

   URI= ( absoluteURI | relativeURI ) [ # fragment ]

   absoluteURI= scheme : *( uchar | reserved )

   relativeURI= net_path | abs_path | rel_path

   net_path   = // net_loc [ abs_path ]
   abs_path   = / rel_path
   rel_path   = [ path ] [ ; params ] [ ? query ]

   path   = fsegment *( / segment )
   fsegment   = 1*pchar
   segment= *pchar

Which I understand to mean that a segment can be empty, which
in turn could be interpreted as stating that the trailing
slashes in slash.xp are significant.

That said, setting up a web site to rely on empty path segments
strikes me as a creative way of looking for problems. :-) Why is
it important to you ?

-- 
André Majorel URL:http://www.teaser.fr/~amajorel/
std::disclaimer (Not speaking for my employer);



Re: Change in behaviour between 1.7 and 1.8.1

2002-03-05 Thread Guentcho Skordev

Hello,

On Tue, Mar 05, 2002 at 11:48:40AM +0100, Philipp Thomas wrote:
 When you issue
  wget --recursive --level=1 --reject=html wwwsusede
 wget 17 really ommits downloading all the html files except indexhtml
 (which is needed for --recursive), but wget 181 also downloads all html
 files that are referenced from indexhtml and deletes them immediately
[]
 but they should be ommitted when the recursion depth is limited and the
 limit has been reached

I have noticed the same behaviour with Wget 153 and Debian (potato)

Bye
Guentcho