Re: Escaping semicolons
Phil Endecott wrote: There is not much to go on in terms of specifications. The closest is RFC1738, which includes BNF for a file: URI. However it is ten years old, so whether it reflects current practice I do not know. But it does not allow ; in file: URIs. I conclude from this that wget should be replacing ; with its %3b escape sequence. I think you're confusing what wget is required to do with URLs entered on the command line and what it chooses to do with the resulting files that it saves. If a unencoded name of retrieved resource cannot be stored on the local file system, wget encodes it to create a valid name. Tony Lewis wrote: I use semicolons in CGI URIs to separate parameters. (Ampersand is more often used for this, but semicolon is also allowed and has the advantage that there is no need to escape it in HTML.) There is no need to escape ampersands either. Tony, are you suggesting that this is legal HTML? a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a I'm fairly confident that you need to escape the to make it valid, i.e. a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a Just out of curiosity, did you try to implement your theory and see what happens? If you did, you would that the first version works and the second does not. By the way, the correct URI encoding of ampersand is %26, not amp;. The latter encoding is used for ampersands in HTML markup. With regard to whether ampersand needs to be encoded, you're misreading the RFC: Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ;, /, ?, :, @, = and are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme. Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL. The RFC says that you have to escape Reserved characters if that character appears in the name of the resource you're trying to retrieve. That is, if you're trying to retrieve a file named ab.txt, you refer to that file as a%26b.txt in the URL because you're using the ampersand for a non-reserved purpose. If you're using a reserved character for the purpose that it has been reserved (in this case, separating parameters), you do NOT want to encode it. The URL you proposed (after correcting the encoding of the ampersand) is requesting a resource (probably a file) whose name is foo.cgi?p1=v1p2=v2. It is NOT requesting that the script foo.cgi be executed with argument p1 having a value of v1 and p2 having a value of v2. Hope that helps. Tony
Re: Escaping semicolons
Dear All, There are now two threads here so I'm splitting it into two messages. Phil I conclude from this that wget should be replacing ; with Phil its %3b escape sequence. Tony I think you're confusing what wget is required to do with Tony URLs entered on the command line and what it chooses to do Tony with the resulting files that it saves. If a unencoded name Tony of retrieved resource cannot be stored on the local file Tony system, wget encodes it to create a valid name. I'm using wget in recursive mode with -k to modify links in the downloaded pages so that they correctly point to their also-downloaded neighbours. The starting URI that I give on the command line doesn't contain any odd characters. To be useful - i.e. to download a copy of a site that just works when I visit it using a file: URI - what wget needs to do is 1. Ensure that link URIs and filenames are consistent, i.e. any changes it makes to one should also be made to the other. 2. Ensure that filenames are legal as far as the operating system is concerned. 3. Ensure that link URIs are legal as far as the web browser is concerned. As I see it the difficulty with semicolons is that they are legal in filenames, but they are only legal in a limited way in URIs. For example, the following URI is legal: http://foo.com/foo.cgi?p1=v1;p2=v2 Say wget finds this URI in a link while it is downloading some other file in the same directory. It will convert it to a relative URI, and replace the ? with an @ (presumably because ? is not allowed in filenames), giving this URI: [EMAIL PROTECTED];p2=v2 Unfortunately this is not a valid URI according to RFC1738, which only allows ; after a ?. When Mozilla encounters this URI the link fails to work. By experimenting I find that changing the link URIs to use %3b while leaving unencoded semicolons in the filenames will work. Whether this is the right thing to do I am uncertain. I am hoping that there is someone reading this who has already visted this problem and can offer some insight. Regards, --Phil.
Re: Escaping semicolons (actually Ampersands)
(2) There are now two threads going on here so I'm splitting it into two messages. Phil Tony, are you suggesting that this is legal HTML? Phil a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a Phil I'm fairly confident that you need to escape the to make it Phil valid, i.e. Phil a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a Tony Just out of curiosity, did you try to implement your theory Tony and see what happens? If you did, you would that the first Tony version works and the second does not. Just because something works it doesn't mean it's right! You can feed all sorts of rubbish into a web browser and it will display it correctly (unmatched tags etc. etc.). Part of this is to maintain compatibility with old HTML specs, other bits are to cope with duff web pages. (I'm curious to know in what way the second version failed for you, though.) Tony By the way, the correct URI encoding of ampersand is %26, Tony not amp;. The latter encoding is used for ampersands in Tony HTML markup. But in this case I am talking HTML, hence the a tags in those examples. My point was simply that ; is preferable to as a CGI parameter separator because there is no need to escape it in HTML, which saves a bit of effort. I'm not talking about URI-escaping it. Here is a complete example: !DOCTYPE html PUBLIC -//W3C//DTD HTML 4.01//EN http://www.w3.org/TR/html4/strict.dtd; html headtitleAmpersand test/title/head body pa href=http://foo.com/foo.cgi?p1=v1p2=v2;link1/a/p pa href=http://foo.com/foo.cgi?p1=v1amp;p2=v2;link2/a/p /body /html Feed this to the W3C validator (http://validator.w3.org/check) and it will complain about the in line 6, and says The most common cause of this error is unencoded ampersands in URLs. It refers to http://www.htmlhelp.com/tools/validator/problems.html#amp for an explanation. Thanks for your feedback Tony, but what I'm really interested in is getting semicolons to work! Regards, --Phil.
wget patch (default basic auth, ssl proxy with auth retry, auth retry keep-alive)
i originally sent the following email with patch 1.5 weeks ago to [EMAIL PROTECTED], but as i haven't received a reply nor has cvs been updated since that time (Hrvoje Niksic may be on vacation or busy), i'm sending this to the users' list as someone might be interested in it (as i'm sure i'm not the only one that needs it). the attached patch has been compiled and (briefly) tested under both linux and windows (mingw). thanks for wget. been using it since '99, so i am glad to pay a small tribute in the spirit of open source by providing this patch. and thanks to whoever maintains the mingw makefile, as it is appreciated (practically and philosophically) to be able to build an open source application on windows with an open source tool chain. (open source applications that are only able to be built on windows using a non-free tool chain, msvc, are stupid and self-defeating.) - description of attached patch (against cvs, as of 24 hours ago)... * SECURITY * if you want, ignore the commenting out of the basic auth by default code. as a security conscious individual, i rather not share my username and password in cleartext (basic) when digest might be needed, or even worse, when no authorization might be needed at all. it requires twice as many connections? that's what persistent connections are for. persistent connections even eliminates the delay of creating a second connection (well, with one of my patches below). and as all that ever gets transmitted that first time is the 401 authorization response, it's not like it's a huge waste of bandwidth. but feel free to ignore it as from the comments in the code it's a conscious/deliberate decision someone has made. * FATAL BUG * when retrying with authorization (goto retry_with_auth), the anchor retry_with_auth is after the proxy code, so if the connection is needed to be made through a proxy (again), wget fails as it tries to directly connect to the http server. conn = u we're using a proxy, so set conn = proxy connect to conn (proxy) set conn = u connect to conn (http server) authorization fails, so retry connect to conn (http server) connection attempt times out because we have to use the proxy so what i did was after the authorization fails, if we are using a proxy, i reset conn = proxy. * BUG * in CLOSE_FINISH macro, if fd == pconn.socket, then we call invalid_persistent, which calls fd_close(pconn.socket), but fd (which was equal to the now closed pconn.socket) is not set to -1 to signify it was closed scenario: 1a. make a proxy connect (for ssl) and fail digest authorization 1b. fd_close(fd) within CLOSE_FINISH macro 2a. make connection again and compute and pass digest authorization 2b. mark connection as persistent 3a. reuse persistent connection but fail digest authorization 3b. invalidate_persistent() within CLOSE_FINISH macro, BUT fd NOT RESET TO -1 4a. try to make connection again, but since sock 0 is false, we try to reuse the now invalid fd * FEATURE * i noticed in the debug output that if an auth failed (on first connection), then the socket was closed even though we were about to retry the connection (this time computing auth). so i changed the code so that we didn't close the socket under those circumstances. (i personally think that as long as the server replies with keep-alive we should make it persistent. an auth failure doesn't directly affect whether a tcp connection can be reused.) so i moved the CLOSE_FINISH(sock) call to after the goto retry_with_auth. but the socket was still not being reused even though the http server was marking the connection keep-alive. then i saw that it was because the content length was -1. just as an auth failure doesn't determine tcp connection reuse, neither should content length. so i commented out the contlen test for keep-alive. about my commentary on keep-alive and wget's conditions for using it: i understand that if an auth fails, most likely other requests to the same server will fail, so there's no use in keeping the socket open. but TECHNICALLY, there's no justification to close it. an auth failure does not affect a computer's ability to maintain a tcp connection for longer than a single http request. so, i understand the (obvious) argument for it, i just don't think it's a good argument. but maybe i'm missing something non-obvious. - thanks for wget. i am glad that i could do a little bit to improve on this already great application. corey wright -- [EMAIL PROTECTED] default_basic_auth_-_ssl_proxy_connect_auth_retry_-_auth_retry_keep_alive.patch Description: Binary data