More flexible URL file name generation

2003-09-14 Thread Hrvoje Niksic
This patch makes URL file name generation a bit more flexible and,
hopefully, better for the end-user.  It does two things:

* Decouples file name quoting from URL quoting.  The conflation of the
  two has been an endless source of annoyance for users.  For example,
  space *has* to be quoted in URLs, but you don't really want to quote
  it in file names.

* Gives the user more control over the quoting mechanism.  There are
  now several quoting levels:

--restrict-file-names=none  - no restriction, only quote / and \0

--restrict-file-names=unix  - quote the above, plus chars in the
  0-31 and in the 128-159 range, which
  are not printable in the shell.

--restrict-file-names=windows - quote the above, plus chars
disallowed on Windows: \, |, , ,
?, :, *, and .

  The default windows under Windows and Cygwin and unix elsewhere.

This patch should supersede the various patches that have been
floating around that fix the problem in a limited fashion.  Please
test this patch and let me know if it works for you, and if something
else is needed.


2003-09-14  Hrvoje Niksic  [EMAIL PROTECTED]

* url.c (append_uri_pathel): Use opt.restrict_file_names when
calling file_unsafe_char.

* init.c: New command restrict_file_names.

* main.c (main): New option --restrict-file-names[=windows,unix].

* url.c (url_file_name): Renamed from url_filename.
(url_file_name): Add directory and hostdir prefix here, not in
mkstruct.
(append_dir_structure): New function, does part of the work that
used to be in mkstruct.  Iterates over path elements in u-path,
calling append_uri_pathel on each one to append it to the file
name.
(append_uri_pathel): URL-unescape a path element and reencode it
with a different set of rules, more appropriate for handling of
files.
(file_unsafe_char): New function, uses a lookup table to decide
whether a character should be escaped for use in file name.
(append_string): New utility function.
(append_char): Ditto.
(file_unsafe_char): New argument restrict_for_windows, decide
whether Windows file names should be escaped in run-time.

* connect.c: Include stdlib.h to get prototype for abort().

Index: NEWS
===
RCS file: /pack/anoncvs/wget/NEWS,v
retrieving revision 1.38
diff -u -r1.38 NEWS
--- NEWS2003/09/10 20:21:13 1.38
+++ NEWS2003/09/14 21:45:48
@@ -7,8 +7,6 @@
 
 * Changes in Wget 1.9.
 
-** The build process now requires Autoconf 2.5x.
-
 ** It is now possible to specify that POST method be used for HTTP
 requests.  For example, `wget --post-data=id=foodata=bar URL' will
 send a POST request with the specified contents.
@@ -32,6 +30,15 @@
 
 ** The new option `--dns-cache=off' may be used to prevent Wget from
 caching DNS lookups.
+
+** The build process now requires Autoconf 2.5x.
+
+** Wget no longer quotes characters in local file names that would be
+considered unsafe as part of URL.  Quoting can still occur for
+control characters or for '/', but no longer for frequent characters
+such as space.  You can use the new option --restrict-file-names to
+enforce even stricter rules, which is useful when downloading to
+Windows partitions.
 
 * Wget 1.8.1 is a bugfix release with no user-visible changes.
 
Index: doc/wget.texi
===
RCS file: /pack/anoncvs/wget/doc/wget.texi,v
retrieving revision 1.68
diff -u -r1.68 wget.texi
--- doc/wget.texi   2003/09/10 19:41:50 1.68
+++ doc/wget.texi   2003/09/14 21:46:10
@@ -800,6 +800,39 @@
 
 If you don't understand the above description, you probably won't need
 this option.
+
[EMAIL PROTECTED] file names, restrict
[EMAIL PROTECTED] Windows file names
[EMAIL PROTECTED] --restrict-file-names=none|unix|windows
+Restrict characters that may occur in local file names created by Wget
+from remote URLs.  Characters that are considered @dfn{unsafe} under a
+set of restrictions are escaped, i.e. replaced with @samp{%XX}, where
[EMAIL PROTECTED] is the hexadecimal code of the character.
+
+The default for this option depends on the operating system: on Unix and
+Unix-like OS'es, it defaults to ``unix''.  Under Windows and Cygwin, it
+defaults to ``windows''.  Changing the default is useful when you are
+using a non-native partition, e.g. when downloading files to a Windows
+partition mounted from Linux, or when using NFS-mounted or SMB-mounted
+Windows drives.
+
+When set to ``none'', the only characters that are quoted are those that
+are impossible to get into a file name---the NUL character and @samp{/}.
+The control characters, newline, etc. are all placed into file names.
+
+When set to ``unix'', 

Re: wget proxy support

2003-09-14 Thread Hrvoje Niksic
Nicolas, thanks for the patch; I'm about to apply it to Wget CVS.



possible bug in exit status codes

2003-09-14 Thread Dawid Michalczyk
Hello,

I'm having problems getting the exit status code to work correctly in the
following scenario. The exit code should be 1 yet it is 0


[EMAIL PROTECTED]:~$ wget -d -t2 -r -l1 -T120 -nd -nH -R 
gif,zip,txt,exe,wmv,htmll,*[1-99]  www.cnn.com/foo.html
DEBUG output created by Wget 1.8.2 on linux-gnu.

Enqueuing http://www.cnn.com/foo.html at depth 0
Queue count 1, maxcount 1.
Dequeuing http://www.cnn.com/foo.html at depth 0
Queue count 0, maxcount 1.
--01:00:11--  http://www.cnn.com/foo.html
   = `foo.html'
Resolving www.cnn.com... done.
Caching www.cnn.com = 64.236.16.52 64.236.16.84 64.236.16.116 64.236.24.4 
64.236.24.12 64.236.24.20 64.236.24.28 64.236.16.20
Connecting to www.cnn.com[64.236.16.52]:80... connected.
Created socket 3.
Releasing 0x80809f8 (new refcount 1).
---request begin---
GET /foo.html HTTP/1.0
User-Agent: Wget/1.8.2
Host: www.cnn.com
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 404 Not found
Server: Netscape-Enterprise/6.1 AOL
Date: Mon, 15 Sep 2003 04:59:58 GMT
Content-type: text/html
Connection: close


Closing fd 3
01:00:11 ERROR 404: Not found.


FINISHED --01:00:11--
Downloaded: 0 bytes in 0 files
[EMAIL PROTECTED]:~$ echo $?
0
[EMAIL PROTECTED]:~$

Dawid Michalczyk