Re: bug in escaped filename calculation?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > In the long run, supporting something like IRL is surely the right > thing to go for, but I have a feeling that we'll be stuck with the > current messy URLs for quite some time to come. So Wget simply needs > to adapt to the current circumstances. If the locale includes "UTF-8" > in any shape or form, it is perfectly safe to assume that it's valid > to create UTF-8 file names. Of course, we don't know if a particular > URL path sequence is really meant to be UTF-8, but there should be no > harm in allowing valid UTF-8 sequences to pass through. In other > words, the default "quote control" policy could simply be smarter > about what "control" means. That's true. I had been thinking I'd just deal with it all together, but there's no reason why we couldn't adjust what "control characters" are based on the locale today. Still, I think it's a low-priority enough issue (given that there are work-arounds), that I may save it to address all in one lump. BTW, there's a related discussion at https://savannah.gnu.org/bugs/index.php?20863, though that one is regarding translating between the current locale and Unicode (for command-line arguments) and back again (for file names). > One consequence would be that Wget creates differently-named files in > different locales, but it's probably a reasonable price to pay for not > breaking an important expectation. Another consequence would be > making users open to IDN homograph attacks, but I don't know if that's > a problem in the context of creating file names (IDN is normally > defined as a misrepresentation of who you communicate with). Aren't we already open to this? That is, if someone directs us to www.microsoft.com, where the "o" of "soft" is replaced by its look-alike in cyrillic, and our DNS server happens to respect IDNs represented literally (instead of translated into the ASCII "punycode" format, as they will be when we support IDNs properly), that "o" in UTF-8 would be 0xD0 0xBE, and so wouldn't get percent-encoded on the way in. One way of dealing with this when we _do_ translate to punycode, would be to keep the punycode version for creation of the "hostname" directory. Though that could be ugly in practice, at least for especially non-latin domain names. The best way of dealing with homographs, though, is to only use IRIs from trusted sources (usually: type them in). > It could be made to recognize UTF-8 character > sequences in UTF-8 locales and exempt valid UTF-8 chars from being > treated as "control" characters. Invalid UTF-8 chars would still pass > all the checks, and non-canonical UTF-8 sequences would be "rejected" > (by condemning their byte values to being escaped as %..). This is > not much work for someone who understands the basics of UTF-8. Right. If the high-bit isn't set, it's ASCII; if it is set, then you can tell by context which high-bits ought to be set in its neighbors. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBWZj7M8hyUobTrERCEL5AJ9Yh7ctcpGUus67WeakvUxzzcbR6wCfV42N 9LXCwW29U/S5QclTl9UTSGg= =Giga -END PGP SIGNATURE-
Re: bug in escaped filename calculation?
Micah Cowan <[EMAIL PROTECTED]> writes: > It is actually illegal to specify byte values outside the range of > ASCII characters in a URL, but it has long been historical practice > to do so anyway. In most cases, the intended meaning was one of the > latin character sets (usually latin1), so Wget was right to do as it > does, at that time. Your explanation is spot-on. I would only add that Wget's interpretation of what is a "control" character is not so much geared toward Latin 1 as it is geared toward maximum safety. Originally I planned to simply encode *all* file name characters outside the 32-127 range, but in practice it was very annoying (not to mention US-centric) to encode perfectly valid Latin 1/2/3/... as %xx. Since the codes 128-159 *are* control characters (in those charsets) that can mess up your screen and that you wouldn't want seen by default, I decided to encode them by default, but allow for a way to turn it off, in case someone used a different charset. In the long run, supporting something like IRL is surely the right thing to go for, but I have a feeling that we'll be stuck with the current messy URLs for quite some time to come. So Wget simply needs to adapt to the current circumstances. If the locale includes "UTF-8" in any shape or form, it is perfectly safe to assume that it's valid to create UTF-8 file names. Of course, we don't know if a particular URL path sequence is really meant to be UTF-8, but there should be no harm in allowing valid UTF-8 sequences to pass through. In other words, the default "quote control" policy could simply be smarter about what "control" means. One consequence would be that Wget creates differently-named files in different locales, but it's probably a reasonable price to pay for not breaking an important expectation. Another consequence would be making users open to IDN homograph attacks, but I don't know if that's a problem in the context of creating file names (IDN is normally defined as a misrepresentation of who you communicate with). For those who want to hack on this, the place to look at is url.c:append_uri_pathel; that strangely-named function takes a path element (a directory name or file name component of the URL) and appends it to the file name. It takes care not to ever use ".." as a path component and to respect the --restrict-file-names setting as specified by the user. It could be made to recognize UTF-8 character sequences in UTF-8 locales and exempt valid UTF-8 chars from being treated as "control" characters. Invalid UTF-8 chars would still pass all the checks, and non-canonical UTF-8 sequences would be "rejected" (by condemning their byte values to being escaped as %..). This is not much work for someone who understands the basics of UTF-8.
Re: bug in escaped filename calculation?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Brian Keck wrote: > Hello, > > I'm wondering if I've found a bug in the excellent wget. > I'm not asking for help, because it turned out not to be the reason > one of my scripts was failing. > > The possible bug is in the derivation of the filename from a URL which > contains UTF-8. > > The case is: > > wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk > > Of course these are all ascii characters, but underlying it are > 3 nonascii characters, whose UTF-8 encoding is: > > hexoctal name > --- - > C387 303 274 C-cedilla > C3B6 303 266 o-umlaut > C3BC 303 274 u-umlaut > > The file created has a name that's almost, but not quite, a valid UTF-8 > bytestring ... > > ls *y*k | od -tc > 000 303 % 8 7 a t a l h 303 266 y 303 274 k \n > > Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring, > but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the > 3-byte string "%87". Using --restrict=nocontrol will do what you want it to, in this instance. > I'm guessing this is not intended. Actually, it is (more-or-less). Realize that Wget really has no idea how to tell whether you're trying to give it UTF-8, or one of the ISO latin charsets. It tends to assume the latter. It also, by default, will not create filenames with control characters in them. In ISO latin, characters in the range 0x80-0x9f are control characters, which is why Wget left %87 escaped, which falls into that range, but not the others, which don't. It is actually illegal to specify byte values outside the range of ASCII characters in a URL, but it has long been historical practice to do so anyway. In most cases, the intended meaning was one of the latin character sets (usually latin1), so Wget was right to do as it does, at that time. There is now a standard for representing Unicode values in URLs, whose result is then called IRLs (Internationalized Resource Locators). Conforming correctly to this standard would require that Wget be sensitive to the context and encoding of documents in which it finds URLs; in the case of filenames and command arguments, it would probably also require sensitivity to the current locale as determined by environment variables. Wget is simply not equipped to handle IRLs or encoding issues at the moment, so until it is, a proper fix will not be in place. Addressing these are considered a "Wget 2.0" (next-generation Wget functionality) priority, and probably won't be done for a year or two, given that the number of developers involved with Wget, if you add up all the part-time helpers (including me), is probably still less than one full-time dev. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBSHX7M8hyUobTrERCKRLAJwKiDOo0uO7x/k/iAEB/W0pPQmUJQCfUHaP c6k2490strgy1Efy1DmiOhA= =7lvZ -END PGP SIGNATURE-
Re: bug in escaped filename calculation?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Josh Williams wrote: > On 10/4/07, Brian Keck <[EMAIL PROTECTED]> wrote: >> I would have sent a fix too, but after finding my way through http.c & >> retr.c I got lost in url.c. > > You and me both. A lot of the code needs re-written.. there's a lot of > spaghetti code in there. I hope Micah chooses to do a complete > re-write for version 2 so I can get my hands dirty and understand the > code better. Currently, I'm planning on refactoring what exists, as needed, rather than going for a complete rewrite. This will be driven by unit-tests, to try to ensure that we do not lose functionality along the way. This involves more work overall, but IMO has these key advantages: * as mentioned, it's easier to prevent functionality loss, * we will be able to use the work as its written, instead of waiting many months for everything to be finished (especially with the current number of developers), and * AIUI, the wording of employer copyright assignment releases may not apply to new works that are not _preexisting_ as GPL works. This means that, if a rewrite ended up using no code whatsoever from the original work (not likely, but...), there could be legal issues. After 1.11 is released (or possibly before), one of my top priorities is to clean up the gethttp and http_loop functions to a degree where they can be much more readily read and understood (and modified!). This is important to me because so far (in my probably-not-statistically-significant 3 months as maintainer) a majority of the trickier fixes have been in those two functions. Some of these fixes seem to frequently introduce bugs of their own, and I spend more time than seems right in trying to understand the code there, which is why these particular functions are prime targets for refactoring. :) - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBR7E7M8hyUobTrERCCrbAJ9Jw7LB/YW4myDOyPiHvXLZ13rkNQCeOVbf 5INV0ApmUTuzxp8zO5haVCA= =EeEd -END PGP SIGNATURE-
Re: bug in escaped filename calculation?
On 10/4/07, Brian Keck <[EMAIL PROTECTED]> wrote: > I would have sent a fix too, but after finding my way through http.c & > retr.c I got lost in url.c. You and me both. A lot of the code needs re-written.. there's a lot of spaghetti code in there. I hope Micah chooses to do a complete re-write for version 2 so I can get my hands dirty and understand the code better.
bug in escaped filename calculation?
Hello, I'm wondering if I've found a bug in the excellent wget. I'm not asking for help, because it turned out not to be the reason one of my scripts was failing. The possible bug is in the derivation of the filename from a URL which contains UTF-8. The case is: wget http://en.wikipedia.org/wiki/%C3%87atalh%C3%B6y%C3%BCk Of course these are all ascii characters, but underlying it are 3 nonascii characters, whose UTF-8 encoding is: hexoctal name --- - C387 303 274 C-cedilla C3B6 303 266 o-umlaut C3BC 303 274 u-umlaut The file created has a name that's almost, but not quite, a valid UTF-8 bytestring ... ls *y*k | od -tc 000 303 % 8 7 a t a l h 303 266 y 303 274 k \n Ie the o-umlaut & u-umlaut UTF-8 encodings occur in the bytestring, but the UTF-8 encoding of C-cedilla has its 2nd byte replaced by the 3-byte string "%87". I'm guessing this is not intended. I would have sent a fix too, but after finding my way through http.c & retr.c I got lost in url.c. Brian Keck