Re: Escaping semicolons

2004-06-27 Thread Tony Lewis
Phil Endecott wrote:

 There is not much to go on in terms of specifications.  The closest is
 RFC1738, which includes BNF for a file: URI.  However it is ten years
 old, so whether it reflects current practice I do not know.  But it does
 not allow ; in file: URIs.

 I conclude from this that wget should be replacing ; with its %3b escape
 sequence.

I think you're confusing what wget is required to do with URLs entered on
the command line and what it chooses to do with the resulting files that it
saves. If a unencoded name of retrieved resource cannot be stored on the
local file system, wget encodes it to create a valid name.

 Tony Lewis wrote:
   I use semicolons in CGI URIs to separate parameters.  (Ampersand
   is more often used for this, but semicolon is also allowed and
   has the advantage that there is no need to escape it in HTML.)
 
  There is no need to escape ampersands either.

 Tony, are you suggesting that this is legal HTML?

   a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a

 I'm fairly confident that you need to escape the  to make it valid, i.e.

   a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a

Just out of curiosity, did you try to implement your theory and see what
happens? If you did, you would that the first version works and the second
does not.

By the way, the correct URI encoding of ampersand is %26, not amp;. The
latter encoding is used for ampersands in HTML markup.

With regard to whether ampersand needs to be encoded, you're misreading the
RFC:

   Many URL schemes reserve certain characters for a special meaning:
   their appearance in the scheme-specific part of the URL has a
   designated semantics. If the character corresponding to an octet is
   reserved in a scheme, the octet must be encoded.  The characters ;,
   /, ?, :, @, = and  are the characters which may be
   reserved for special meaning within a scheme. No other characters may
   be reserved within a scheme.

   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. However, this is not
   true for reserved characters: encoding a character reserved for a
   particular scheme may change the semantics of a URL.

The RFC says that you have to escape Reserved characters if that character
appears in the name of the resource you're trying to retrieve. That is, if
you're trying to retrieve a file named ab.txt, you refer to that file as
a%26b.txt in the URL because you're using the ampersand for a non-reserved
purpose.

If you're using a reserved character for the purpose that it has been
reserved (in this case, separating parameters), you do NOT want to encode
it. The URL you proposed (after correcting the encoding of the ampersand) is
requesting a resource (probably a file) whose name is foo.cgi?p1=v1p2=v2.
It is NOT requesting that the script foo.cgi be executed with argument p1
having a value of v1 and p2 having a value of v2.

Hope that helps.

Tony



Re: Escaping semicolons

2004-06-27 Thread Phil Endecott
Dear All,

There are now two threads here so I'm splitting it into two messages.

Phil I conclude from this that wget should be replacing ; with
Phil its %3b escape sequence.

Tony I think you're confusing what wget is required to do with
Tony URLs entered on the command line and what it chooses to do
Tony with the resulting files that it saves. If a unencoded name
Tony of retrieved resource cannot be stored on the local file
Tony system, wget encodes it to create a valid name.

I'm using wget in recursive mode with -k to modify links in the downloaded pages so 
that they correctly point to their also-downloaded neighbours.  The starting URI that 
I give on the command line doesn't contain any odd characters.

To be useful - i.e. to download a copy of a site that just works when I visit it using 
a file: URI - what wget needs to do is

1. Ensure that link URIs and filenames are consistent, i.e. any changes it makes to 
one should also be made to the other.
2. Ensure that filenames are legal as far as the operating system is concerned.
3. Ensure that link URIs are legal as far as the web browser is concerned.

As I see it the difficulty with semicolons is that they are legal in filenames, but 
they are only legal in a limited way in URIs.  For example, the following URI is legal:

  http://foo.com/foo.cgi?p1=v1;p2=v2

Say wget finds this URI in a link while it is downloading some other file in the same 
directory.  It will convert it to a relative URI, and replace the ? with an @ 
(presumably because ? is not allowed in filenames), giving this URI:

  [EMAIL PROTECTED];p2=v2

Unfortunately this is not a valid URI according to RFC1738, which only allows ; 
after a ?.  When Mozilla encounters this URI the link fails to work.

By experimenting I find that changing the link URIs to use %3b while leaving unencoded 
semicolons in the filenames will work.  Whether this is the right thing to do I am 
uncertain.  I am hoping that there is someone reading this who has already visted this 
problem and can offer some insight.

Regards,

--Phil.








Re: Escaping semicolons (actually Ampersands)

2004-06-27 Thread Phil Endecott
(2) There are now two threads going on here so I'm splitting it into two messages.

Phil Tony, are you suggesting that this is legal HTML?
Phil   a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a
Phil I'm fairly confident that you need to escape the  to make it
Phil valid, i.e.
Phil   a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a

Tony Just out of curiosity, did you try to implement your theory
Tony and see what happens? If you did, you would that the first
Tony version works and the second does not.

Just because something works it doesn't mean it's right!  You can feed all sorts of 
rubbish into a web browser and it will display it correctly (unmatched tags etc. 
etc.).  Part of this is to maintain compatibility with old HTML specs, other bits are 
to cope with duff web pages.  (I'm curious to know in what way the second version 
failed for you, though.)

Tony By the way, the correct URI encoding of ampersand is %26,
Tony not amp;. The latter encoding is used for ampersands in
Tony HTML markup.

But in this case I am talking HTML, hence the a tags in those examples.  My point 
was simply that ; is preferable to  as a CGI parameter separator because there is 
no need to escape it in HTML, which saves a bit of effort.  I'm not talking about 
URI-escaping it.

Here is a complete example:

!DOCTYPE html PUBLIC -//W3C//DTD HTML 4.01//EN
http://www.w3.org/TR/html4/strict.dtd;
html
headtitleAmpersand test/title/head
body
pa href=http://foo.com/foo.cgi?p1=v1p2=v2;link1/a/p
pa href=http://foo.com/foo.cgi?p1=v1amp;p2=v2;link2/a/p
/body
/html

Feed this to the W3C validator (http://validator.w3.org/check) and it will complain 
about the  in line 6, and says The most common cause of this error is unencoded 
ampersands in URLs.  It refers to 
http://www.htmlhelp.com/tools/validator/problems.html#amp for an explanation.

Thanks for your feedback Tony, but what I'm really interested in is getting semicolons 
to work!

Regards,

--Phil.



wget patch (default basic auth, ssl proxy with auth retry, auth retry keep-alive)

2004-06-27 Thread Corey Wright
i originally sent the following email with patch 1.5 weeks ago to
[EMAIL PROTECTED], but as i haven't received a reply nor has cvs
been updated since that time (Hrvoje Niksic may be on vacation or busy),
i'm sending this to the users' list as someone might be interested in it
(as i'm sure i'm not the only one that needs it).

the attached patch has been compiled and (briefly) tested under both
linux and windows (mingw).

thanks for wget.  been using it since '99, so i am glad to pay a small
tribute in the spirit of open source by providing this patch.

and thanks to whoever maintains the mingw makefile, as it is appreciated
(practically and philosophically) to be able to build an open source
application on windows with an open source tool chain.  (open source
applications that are only able to be built on windows using a non-free
tool chain, msvc, are stupid and self-defeating.)

-

description of attached patch (against cvs, as of 24 hours ago)...

* SECURITY *

if you want, ignore the commenting out of the basic auth by default
code.  as a security conscious individual, i rather not share my
username and password in cleartext (basic) when digest might be needed,
or even worse, when no authorization might be needed at all.  it
requires twice as many connections?  that's what persistent connections
are for.  persistent connections even eliminates the delay of creating a
second connection (well, with one of my patches below).  and as all that
ever gets transmitted that first time is the 401 authorization response,
it's not like it's a huge waste of bandwidth.

but feel free to ignore it as from the comments in the code it's a
conscious/deliberate decision someone has made.

* FATAL BUG *

when retrying with authorization (goto retry_with_auth), the anchor
retry_with_auth is after the proxy code, so if the connection is
needed to be made through a proxy (again), wget fails as it tries to
directly connect to the http server.

conn = u
we're using a proxy, so set conn = proxy
connect to conn (proxy)
set conn = u
connect to conn (http server)
authorization fails, so retry
connect to conn (http server)
connection attempt times out because we have to use the proxy

so what i did was after the authorization fails, if we are using a
proxy, i reset conn = proxy.

* BUG *

in CLOSE_FINISH macro, if fd == pconn.socket, then we call
invalid_persistent, which calls fd_close(pconn.socket), but fd (which
was equal to the now closed pconn.socket) is not set to -1 to signify it
was closed

scenario:
1a. make a proxy connect (for ssl) and fail digest authorization
1b. fd_close(fd) within CLOSE_FINISH macro
2a. make connection again and compute and pass digest authorization
2b. mark connection as persistent
3a. reuse persistent connection but fail digest authorization
3b. invalidate_persistent() within CLOSE_FINISH macro, BUT fd NOT
RESET TO -1
4a. try to make connection again, but since sock  0 is
false, we try to reuse the now invalid fd

* FEATURE *

i noticed in the debug output that if an auth failed (on first
connection), then the socket was closed even though we were about to
retry the connection (this time computing auth).  so i changed the code
so that we didn't close the socket under those circumstances.  (i
personally think that as long as the server replies with keep-alive we
should make it persistent.  an auth failure doesn't directly affect
whether a tcp connection can be reused.)

so i moved the CLOSE_FINISH(sock) call to after the goto
retry_with_auth.  but the socket was still not being reused even though
the http server was marking the connection keep-alive.  then i saw
that it was because the content length was -1.  just as an auth failure
doesn't determine tcp connection reuse, neither should content length. 
so i commented out the contlen test for keep-alive.

about my commentary on keep-alive and wget's conditions for using it: i
understand that if an auth fails, most likely other requests to the same
server will fail, so there's no use in keeping the socket open.  but
TECHNICALLY, there's no justification to close it.  an auth failure does
not affect a computer's ability to maintain a tcp connection for longer
than a single http request.  so, i understand the (obvious) argument for
it, i just don't think it's a good argument.  but maybe i'm missing
something non-obvious.

-

thanks for wget.  i am glad that i could do a little bit to improve on
this already great application.

corey wright
-- 
[EMAIL PROTECTED]


default_basic_auth_-_ssl_proxy_connect_auth_retry_-_auth_retry_keep_alive.patch
Description: Binary data