Incorrect 'beautification' of URL?
When requesting a URL like http://tmp.logix.cz/slash.xp , wget shortens this to http://tmp.logix.cz/slash.xp/; All Browsers I tested (Opera 6b1, Mozilla 0.9.8, Konqueror 2.9.2) pass this URL as given. So the question is, why wget (1.8.1) does what it does and how to possibly switch off this behaviour. Philipp -- Philipp Thomas [EMAIL PROTECTED] SuSE Linux AG, Deutscherrnstr. 15-19, D-90429 Nuremberg, Germany HPUX and sane have never been uttered in the same sentence without accompanying negatives. -- Richard Henderson on gcc ml
Change in behaviour between 1.7 and 1.8.1
When you issue wget --recursive --level=1 --reject=.html www.suse.de wget 1.7 really ommits downloading all the .html files except index.html (which is needed for --recursive), but wget 1.8.1 also downloads all .html files that are referenced from index.html and deletes them immediately. It is clear that the .html files are needed to find the next level of files when downloading recursively, but they should be ommitted when the recursion depth is limited and the limit has been reached. Philipp -- Philipp Thomas [EMAIL PROTECTED] SuSE Linux AG, Deutscherrnstr. 15-19, D-90429 Nuremberg, Germany
Re: bOY YOUR PROGRAMME IS GOOD.
Alan Eldridge wrote: On Tue, Mar 05, 2002 at 05:40:22AM +, [EMAIL PROTECTED] wrote: --- end of quoted message --- See what it looks like as text? Please do not post to the mailing list in html. It's rude, and I, for one, will neither read nor answer a post in html except for a one-time note like this. Jawohl Herr Obersturmbandfuehrer!! Ve did not know zatt ze group-members kan nott rrread da HTML. Ve suggest you accept the MicroSoft moronized vorrld. Thanks. If you repost as text, I'll take a look at your question. Alright! Hi! First of all, thanks for your greatest of programmes ... a suggestion, or a hassle, you decide: I wanted to download only the *.ht* files from an ftp-server, including subdirectories, but couldn't figure out how to get wget to follow ftp-directory-listings. I tried a lot ... starting with wget -N -r ftp://myname:[EMAIL PROTECTED]/*.htm* to wget -N -m ftp://myname:[EMAIL PROTECTED]/*.htm* --retr-symlinks -r -l 1 -x --follow-ftp and I gave up. Maybe I'll make Also, please post to [EMAIL PROTECTED] Cc: changed to there. I don't know what that means, but I happily comply. oops.. -- cheers, Norb To prevent terrorism by dropping bombs on Iraq is such an obvious idea that I can't think why no one has thought of it before. It's so simple. If only the UK had done something similar in Northern Ireland, we wouldn't be in the mess we are in today. Terry Jones 17Feb2002 http://www.zmag.org/content/TerrorWar/JonesTerror.cfm
Re: bOY YOUR PROGRAMME IS GOOD.
On Tue, Mar 05, 2002 at 03:28:39PM +, [EMAIL PROTECTED] wrote: I wanted to download only the *.ht* files from an ftp-server, including subdirectories, but couldn't figure out how to get wget to follow ftp-directory-listings. I tried a lot ... starting with wget -N -r ftp://myname:[EMAIL PROTECTED]/*.htm* to wget -N -m ftp://myname:[EMAIL PROTECTED]/*.htm* --retr-symlinks -r -l 1 -x --follow-ftp There *is* an option to turn on globbing for ftp retrieval (that is what wildcard-expansion is called in Unixland), but because of the way it works, it's not going to do what you want. There is also an accept option, which looks like -A*.ht* for your case. That won't work either, because the subdirectory names don't match so you won't recursively descend the directory tree. So, you've got to approach it backwards, using the reject option (-R pat[,pat...]). What you need to do figure out what you *don't* want, and specify that as a pattern list to -R. E.g., to retrieve everything but jpg or gif files, you would use the option: -R*.jpg,*.gif. When you are using -R to reject things, you want to tell it to just retrieve everything (then the patterns given to -R become your filter). To do this, specify the URL with a trailing slash, rather than using a glob pattern. IOW, use the URL: ftp://myname:[EMAIL PROTECTED]/ Last little thing: you don't need to specify -N *and* -m. The -m option implies -N. Also, please post to [EMAIL PROTECTED] Cc: changed to there. I don't know what that means, but I happily comply. That's the main mailing list address. -- Alan Eldridge Dave's not here, man.
Re: Incorrect 'beautification' of URL?
On 2002-03-05 11:41 +0100, Philipp Thomas wrote: When requesting a URL like http://tmp.logix.cz/slash.xp , wget shortens this to http://tmp.logix.cz/slash.xp/. All Browsers I tested (Opera 6b1, Mozilla 0.9.8, Konqueror 2.9.2) pass this URL as given. So the question is, why wget (1.8.1) does what it does Presumably because the author thought that both URLs are equivalent. To my surprise, RFC 1945 seems to agree with you. It says : URI= ( absoluteURI | relativeURI ) [ # fragment ] absoluteURI= scheme : *( uchar | reserved ) relativeURI= net_path | abs_path | rel_path net_path = // net_loc [ abs_path ] abs_path = / rel_path rel_path = [ path ] [ ; params ] [ ? query ] path = fsegment *( / segment ) fsegment = 1*pchar segment= *pchar Which I understand to mean that a segment can be empty, which in turn could be interpreted as stating that the trailing slashes in slash.xp are significant. That said, setting up a web site to rely on empty path segments strikes me as a creative way of looking for problems. :-) Why is it important to you ? -- André Majorel URL:http://www.teaser.fr/~amajorel/ std::disclaimer (Not speaking for my employer);
Re: Change in behaviour between 1.7 and 1.8.1
Hello, On Tue, Mar 05, 2002 at 11:48:40AM +0100, Philipp Thomas wrote: When you issue wget --recursive --level=1 --reject=html wwwsusede wget 17 really ommits downloading all the html files except indexhtml (which is needed for --recursive), but wget 181 also downloads all html files that are referenced from indexhtml and deletes them immediately [] but they should be ommitted when the recursion depth is limited and the limit has been reached I have noticed the same behaviour with Wget 153 and Debian (potato) Bye Guentcho