Re: bug in escaped filename calculation?

2007-10-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hrvoje Niksic wrote:

> In the long run, supporting something like IRL is surely the right
> thing to go for, but I have a feeling that we'll be stuck with the
> current messy URLs for quite some time to come.  So Wget simply needs
> to adapt to the current circumstances.  If the locale includes "UTF-8"
> in any shape or form, it is perfectly safe to assume that it's valid
> to create UTF-8 file names.  Of course, we don't know if a particular
> URL path sequence is really meant to be UTF-8, but there should be no
> harm in allowing valid UTF-8 sequences to pass through.  In other
> words, the default "quote control" policy could simply be smarter
> about what "control" means.

That's true. I had been thinking I'd just deal with it all together, but
there's no reason why we couldn't adjust what "control characters" are
based on the locale today. Still, I think it's a low-priority enough
issue (given that there are work-arounds), that I may save it to address
all in one lump.

BTW, there's a related discussion at
https://savannah.gnu.org/bugs/index.php?20863, though that one is
regarding  translating between the current locale and Unicode (for
command-line arguments) and back again (for file names).

> One consequence would be that Wget creates differently-named files in
> different locales, but it's probably a reasonable price to pay for not
> breaking an important expectation.  Another consequence would be
> making users open to IDN homograph attacks, but I don't know if that's
> a problem in the context of creating file names (IDN is normally
> defined as a misrepresentation of who you communicate with).

Aren't we already open to this? That is, if someone directs us to
www.microsoft.com, where the "o" of "soft" is replaced by its look-alike
in cyrillic, and our DNS server happens to respect IDNs represented
literally (instead of translated into the ASCII "punycode" format, as
they will be when we support IDNs properly), that "o" in UTF-8 would be
0xD0 0xBE, and so wouldn't get percent-encoded on the way in.

One way of dealing with this when we _do_ translate to punycode, would
be to keep the punycode version for creation of the "hostname"
directory. Though that could be ugly in practice, at least for
especially non-latin domain names.

The best way of dealing with homographs, though, is to only use IRIs
from trusted sources (usually: type them in).

> It could be made to recognize UTF-8 character
> sequences in UTF-8 locales and exempt valid UTF-8 chars from being
> treated as "control" characters.  Invalid UTF-8 chars would still pass
> all the checks, and non-canonical UTF-8 sequences would be "rejected"
> (by condemning their byte values to being escaped as %..).  This is
> not much work for someone who understands the basics of UTF-8.

Right. If the high-bit isn't set, it's ASCII; if it is set, then you can
tell by context which high-bits ought to be set in its neighbors.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBWZj7M8hyUobTrERCEL5AJ9Yh7ctcpGUus67WeakvUxzzcbR6wCfV42N
9LXCwW29U/S5QclTl9UTSGg=
=Giga
-END PGP SIGNATURE-


Re: bug in escaped filename calculation?

2007-10-04 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes:

> It is actually illegal to specify byte values outside the range of
> ASCII characters in a URL, but it has long been historical practice
> to do so anyway. In most cases, the intended meaning was one of the
> latin character sets (usually latin1), so Wget was right to do as it
> does, at that time.

Your explanation is spot-on.  I would only add that Wget's
interpretation of what is a "control" character is not so much geared
toward Latin 1 as it is geared toward maximum safety.  Originally I
planned to simply encode *all* file name characters outside the 32-127
range, but in practice it was very annoying (not to mention
US-centric) to encode perfectly valid Latin 1/2/3/... as %xx.  Since
the codes 128-159 *are* control characters (in those charsets) that
can mess up your screen and that you wouldn't want seen by default, I
decided to encode them by default, but allow for a way to turn it off,
in case someone used a different charset.

In the long run, supporting something like IRL is surely the right
thing to go for, but I have a feeling that we'll be stuck with the
current messy URLs for quite some time to come.  So Wget simply needs
to adapt to the current circumstances.  If the locale includes "UTF-8"
in any shape or form, it is perfectly safe to assume that it's valid
to create UTF-8 file names.  Of course, we don't know if a particular
URL path sequence is really meant to be UTF-8, but there should be no
harm in allowing valid UTF-8 sequences to pass through.  In other
words, the default "quote control" policy could simply be smarter
about what "control" means.

One consequence would be that Wget creates differently-named files in
different locales, but it's probably a reasonable price to pay for not
breaking an important expectation.  Another consequence would be
making users open to IDN homograph attacks, but I don't know if that's
a problem in the context of creating file names (IDN is normally
defined as a misrepresentation of who you communicate with).

For those who want to hack on this, the place to look at is
url.c:append_uri_pathel; that strangely-named function takes a path
element (a directory name or file name component of the URL) and
appends it to the file name.  It takes care not to ever use ".." as a
path component and to respect the --restrict-file-names setting as
specified by the user.  It could be made to recognize UTF-8 character
sequences in UTF-8 locales and exempt valid UTF-8 chars from being
treated as "control" characters.  Invalid UTF-8 chars would still pass
all the checks, and non-canonical UTF-8 sequences would be "rejected"
(by condemning their byte values to being escaped as %..).  This is
not much work for someone who understands the basics of UTF-8.


Re: bug in escaped filename calculation?

2007-10-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Brian Keck wrote:
> Hello,
> 
> I'm wondering if I've found a bug in the excellent wget.
> I'm not asking for help, because it turned out not to be the reason
> one of my scripts was failing.
> 
> The possible bug is in the derivation of the filename from a URL which
> contains UTF-8.
> 
> The case is:
> 
>   wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk
> 
> Of course these are all ascii characters, but underlying it are
> 3 nonascii characters, whose UTF-8 encoding is:
> 
>   hexoctal name
>     ---  -
>   C387  303 274  C-cedilla
>   C3B6  303 266  o-umlaut
>   C3BC  303 274  u-umlaut
> 
> The file created has a name that's almost, but not quite, a valid UTF-8
> bytestring ... 
> 
>   ls *y*k | od -tc
>   000 303   %   8   7   a   t   a   l   h 303 266   y 303 274   k  \n
> 
> Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring,
> but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the
> 3-byte string "%87".

Using --restrict=nocontrol will do what you want it to, in this instance.

> I'm guessing this is not intended.  

Actually, it is (more-or-less).

Realize that Wget really has no idea how to tell whether you're trying
to give it UTF-8, or one of the ISO latin charsets. It tends to assume
the latter. It also, by default, will not create filenames with control
characters in them. In ISO latin, characters in the range 0x80-0x9f are
control characters, which is why Wget left %87 escaped, which falls into
that range, but not the others, which don't.

It is actually illegal to specify byte values outside the range of ASCII
characters in a URL, but it has long been historical practice to do so
anyway. In most cases, the intended meaning was one of the latin
character sets (usually latin1), so Wget was right to do as it does, at
that time.

There is now a standard for representing Unicode values in URLs, whose
result is then called IRLs (Internationalized Resource Locators).
Conforming correctly to this standard would require that Wget be
sensitive to the context and encoding of documents in which it finds
URLs; in the case of filenames and command arguments, it would probably
also require sensitivity to the current locale as determined by
environment variables. Wget is simply not equipped to handle IRLs or
encoding issues at the moment, so until it is, a proper fix will not be
in place. Addressing these are considered a "Wget 2.0" (next-generation
Wget functionality) priority, and probably won't be done for a year or
two, given that the number of developers involved with Wget, if you add
up all the part-time helpers (including me), is probably still less than
one full-time dev. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBSHX7M8hyUobTrERCKRLAJwKiDOo0uO7x/k/iAEB/W0pPQmUJQCfUHaP
c6k2490strgy1Efy1DmiOhA=
=7lvZ
-END PGP SIGNATURE-


Re: bug in escaped filename calculation?

2007-10-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Josh Williams wrote:
> On 10/4/07, Brian Keck <[EMAIL PROTECTED]> wrote:
>> I would have sent a fix too, but after finding my way through http.c &
>> retr.c I got lost in url.c.
> 
> You and me both. A lot of the code needs re-written.. there's a lot of
> spaghetti code in there. I hope Micah chooses to do a complete
> re-write for version 2 so I can get my hands dirty and understand the
> code better.

Currently, I'm planning on refactoring what exists, as needed, rather
than going for a complete rewrite. This will be driven by unit-tests, to
try to ensure that we do not lose functionality along the way. This
involves more work overall, but IMO has these key advantages:

 * as mentioned, it's easier to prevent functionality loss,
 * we will be able to use the work as its written, instead of waiting
many months for everything to be finished (especially with the current
number of developers), and
 * AIUI, the wording of employer copyright assignment releases may not
apply to new works that are not _preexisting_ as GPL works. This means
that, if a rewrite ended up using no code whatsoever from the original
work (not likely, but...), there could be legal issues.

After 1.11 is released (or possibly before), one of my top priorities is
to clean up the gethttp and http_loop functions to a degree where they
can be much more readily read and understood (and modified!). This is
important to me because so far (in my
probably-not-statistically-significant 3 months as maintainer) a
majority of the trickier fixes have been in those two functions. Some of
these fixes seem to frequently introduce bugs of their own, and I spend
more time than seems right in trying to understand the code there, which
is why these particular functions are prime targets for refactoring. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBR7E7M8hyUobTrERCCrbAJ9Jw7LB/YW4myDOyPiHvXLZ13rkNQCeOVbf
5INV0ApmUTuzxp8zO5haVCA=
=EeEd
-END PGP SIGNATURE-


Re: bug in escaped filename calculation?

2007-10-04 Thread Josh Williams
On 10/4/07, Brian Keck <[EMAIL PROTECTED]> wrote:
> I would have sent a fix too, but after finding my way through http.c &
> retr.c I got lost in url.c.

You and me both. A lot of the code needs re-written.. there's a lot of
spaghetti code in there. I hope Micah chooses to do a complete
re-write for version 2 so I can get my hands dirty and understand the
code better.


bug in escaped filename calculation?

2007-10-04 Thread Brian Keck

Hello,

I'm wondering if I've found a bug in the excellent wget.
I'm not asking for help, because it turned out not to be the reason
one of my scripts was failing.

The possible bug is in the derivation of the filename from a URL which
contains UTF-8.

The case is:

  wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk

Of course these are all ascii characters, but underlying it are
3 nonascii characters, whose UTF-8 encoding is:

  hexoctal name
    ---  -
  C387  303 274  C-cedilla
  C3B6  303 266  o-umlaut
  C3BC  303 274  u-umlaut

The file created has a name that's almost, but not quite, a valid UTF-8
bytestring ... 

  ls *y*k | od -tc
  000 303   %   8   7   a   t   a   l   h 303 266   y 303 274   k  \n

Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring,
but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the
3-byte string "%87".

I'm guessing this is not intended.  

I would have sent a fix too, but after finding my way through http.c &
retr.c I got lost in url.c.

Brian Keck