Re: Missing asprintf()

2008-09-09 Thread Hrvoje Niksic
Gisle Vanem [EMAIL PROTECTED] writes:

 Why the need for asprintf() in url.c:903? This function is missing
 on DOS/Win32 and nowhere to be found in ./lib.

Wget is supposed to use aprintf, which is defined in utils.c, and is
not specific to Unix.

It's preferable to use an asprintf-like functions than a static buffer
because it supports reentrance (unlike a static buffer) and imposes no
arbitrary limits on error output.


Re: AW: AW: Problem mirroring a site using ftp over proxy

2008-08-12 Thread Hrvoje Niksic
Juon, Stefan [EMAIL PROTECTED] writes:

 I just noticed these debug messages:

 **
 DEBUG output created by Wget 1.10.2 on cygwin.

You are of course aware that this is not the latest Wget (1.11.4)?
As mentioned before, recursive download over FTP proxy was broken
prior to Wget 1.11.

 The point is that wget sends rather a http request than a pure ftp
 command

That's how proxying FTP normally works.


Re: About Automated Unit Test for Wget

2008-04-06 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I don't see what you see wrt making the code harder to follow and reason
 about (true abstraction rarely does, AFAICT,

I was referring to the fact that adding an abstraction layer requires
learning about the abstraction layer, both its concepts and its
implementation, including its quirks and limitations.  Too general
abstractions added to application software are typically to be
underspecified (for the domain they attempt to cover) and incomplete.
Programmers tend to ignore the hidden cost of adding an abstraction
layer until the cost becomes apparent, by which time it is too late.

Application-specific abstractions are usually worth it because they
are well-justified: they directly benefit the application by making
the code base simpler and removing duplication.  Some general
abstractions are worth it because the alternative is worse; you
wouldn't want to have two versions of SSL-using code, one for regular
sockets, and one for SSL, since the whole point of SSL is that you're
supposed to use it as if it were sockets behind the scenes.  But
adding a whole new abstraction layer over something as general as
Berkely sockets to facilitate an automated test suite definitely
sounds like ignoring the costs of such an abstraction layer.

 I _am_ thinking that it'd probably be best to forgo the idea of
 one-to-one correspondence of Berkeley sockets, and pass around a struct
 net_connector * (and struct net_listener *), so we're not forced to
 deal with file descriptor silliness (where obviously we'd have wanted to
 avoid the values 0 through 2, and I was even thinking it might
 _possibly_ be worthwhile to allocate real file descriptors to get the
 numbers, just to avoid clashes).

I have no idea what file descriptor silliness with values 0-2 you're
referring to.  :-)  I do agree that an application-specific struct is
better than a more general abstraction because it is easier to design
and more useful to Wget in the long run.

 This would mean we'd need to separate uses of read() and write() on
 normal files (which should continue to use the real calls, until we
 replace them with the file I/O abstractions), from uses of read(),
 write(), etc on sockets, which would be using our emulated versions.
 
 Unless you're willing to spend a lot of time in careful design of
 these abstractions, I think this is a mistake.

 Why?

Because implementing a file I/O abstraction is much harder and more
time-consuming than it sounds.  To paraphrase Greenspun, it would
appear that every sufficiently large code base contains an ad-hoc,
informally-specified, bug-ridden implementation of a streaming layer.
There are streaming libraries out there; maybe we should consider
using some of them.


Re: About Automated Unit Test for Wget

2008-04-05 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Or did you mean to write wget version of socket interface?  i.e. to
 write our version of socket, connect,write,read,close,bind,
 listen,accept,,,? sorry I'm confused.

 Yes! That's what I meant. (Except, we don't need listen, accept; and
 we only need bind to support --bind-address. We're a client, not a
 server. ;) )

 It would be enough to write function-pointers for (say), wg_socket,
 wg_connect, wg_sock_write, wg_sock_read, etc, etc, and point them at
 system socket, connect, etc for real Wget, but at wg_test_socket,
 wg_test_connect, etc for our emulated servers.

This seems like a neat idea, but it should be carefully weighed
against the drawbacks.  Adding an ad-hoc abstraction layer is harder
than it sounds, and has more repercussions than is immediately
obvious.  An underspecified, unfinished abstraction layer over sockets
makes the code harder, not easier, to follow and reason about.  You no
longer deal with BSD sockets, you deal with an abstraction over them.
Is it okay to call getsockname on such a socket?  How about
setsockopt?  What about the listen/bind mechanism (which we do need,
as Daniel points out)?

 This would mean we'd need to separate uses of read() and write() on
 normal files (which should continue to use the real calls, until we
 replace them with the file I/O abstractions), from uses of read(),
 write(), etc on sockets, which would be using our emulated versions.

Unless you're willing to spend a lot of time in careful design of
these abstractions, I think this is a mistake.


Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic
Alain Guibert [EMAIL PROTECTED] writes:

  On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote:

 Micah Cowan [EMAIL PROTECTED] writes:
 It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME

 The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code
 apparently intending to return FNM_NOMATCH on a slash. But this code
 seems to be rather broken.

Or it could be that you're picking up a different fnmatch.h that sets
up a different value for FNM_PATHNAME.  Do you have more than one
fnmatch.h installed on your system?


Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic
Alain Guibert [EMAIL PROTECTED] writes:

 Maybe you could put a breakpoint in fnmatch and see what goes wrong?

 The for loop intended to eat several characters from the string also
 advances the pattern pointer. This one reaches the end of the pattern,
 and points to a NUL. It is not a '*' anymore, so the loop exits
 prematurely. Just below, a test for NUL returns 0.

Thanks for the analysis.  Looking at the current fnmatch code in
gnulib, it seems that the fix is to change that NUL test to something
like:

  if (c == '\0')
{
  /* The wildcard(s) is/are the last element of the pattern.
 If the name is a file name and contains another slash
 this means it cannot match. */
  int result = (flags  FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH;
  if (flags  FNM_PATHNAME)
{
  if (!strchr (n, '/'))
result = 0;
}
  return result;
}

But I'm not at all sure that it covers all the needed cases.  Maybe we
should simply switch to gnulib-provided fnmatch?  Unfortunately that
one is quite complex and quite hard for the '**' extension Micah
envisions.  There might be other fnmatch implementations out there in
GNU which are debugged but still simpler than the gnulib/glibc one.


It's kind of ironic that while the various system fnmatches were
considered broken, the one Wget was using (for many years
unconditionally!) was also broken.


Re: Stop the title from changing

2008-04-04 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 My Name? wrote:
 Hello,
 
 I was wondering if there was a way to prevent the title changing...
 wget is currently nested in another script, and would probally confuse
 the user as to why the title says wget file location is it possible
 to retain its former title? (at the top of the script is title my
 title and i would like that to remain.)

 Is this on Windows?

 Here's a copy-paste from an answer Christopher Lewis gave to someone
 asking a similar question:
[...]

Christopher describes how to hide the console window opened by Wget,
while the poster would like to prevent Wget from changing the title of
the existing console.

Maybe we should make the title-changing behavior optional.  After all,
we don't do anything of the sort on Unix, nor (IMHO) should we.


Re: wget 1.11.1 make test fails

2008-04-03 Thread Hrvoje Niksic
Alain Guibert [EMAIL PROTECTED] writes:

 This old system does HAVE_WORKING_FNMATCH_H (and thus
 SYSTEM_FNMATCH).  When #undefining SYSTEM_FNMATCH, the test still
 fails at the very same line. And then it also fails on modern
 systems. I guess this points at the embedded src/cmpt.c:fnmatch()
 replacement?

Well, it would point to a problem with both the fnmatch replacement
and the older system fnmatch.  Our fnmatch (coming from an old
release of Bash, but otherwise very well-tested, both in Bash and
Wget) is careful to special-case '/' only if FNM_PATHNAME is
specified.

Maybe you could put a breakpoint in fnmatch and see what goes wrong?


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Alain Guibert [EMAIL PROTECTED] writes:

 Hello Micah,

  On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:

 could you try to isolate which part of test_dir_matches_p is failing?

 The only failing src/utils.c test_array[] line is:

 | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false },

 I don't understand enough of dir_matches_p() and fnmatch() to guess
 what is supposed to happen. But with false replaced by true, this
 test and following succeed.

'*' is not supposed to match '/' in regular fnmatch.

It sounds like a libc problem rather than a gcc problem.  Try
#undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 It sounds like a libc problem rather than a gcc problem.  Try
 #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

 It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
 mean, don't most shells rely on this to handle file globbing and
 whatnot?

The conventional wisdom among free software of the 90s was that
fnmatch() was too buggy to be useful.  For that reason all free shells
rolled their own fnmatch, as did other programs that needed it,
including Wget.  Maybe the conventional wisdom was right for the
reporter's system.

Another possibility is that something else is installing fnmatch.h in
a directory on the compiler's search path and breaking the system
fnmatch.  IIRC Apache was a known culprit that installed fnmatch.h in
/usr/local/include.  That was another reason why Wget used to
completely ignore system-provided fnmatch.

In any case, it should be easy enough to isolate the problem:

#include stdio.h
#include fnmatch.h
int main()
{
  printf(%d\n, fnmatch(foo*, foo/bar, FNM_PATHNAME));
  return 0;
}

It should print a non-zero value.


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I'm wondering whether it might make sense to go back to completely
 ignoring the system-provided fnmatch?

One argument against that approach is that it increases code size on
systems that do correctly implement fnmatch, i.e. on most modern
Unixes that we are targeting.  Supporting I18N file names would
require modifications to our fnmatch; but on the other hand, we still
need it for Windows, so we'd have to make those changes anyway.

Providing added value in our fnmatch implementation should go a long
way towards preventing complaints of code bloat.

 In particular, it would probably resolve the remaining issue with
 that one bug you reported about fnmatch() failing on strings whose
 encoding didn't match the locale.

It would.

 Additionally, I've been toying with the idea of adding something
 like a ** to match all characters, including slashes.

That would be great.  That kind of thing is known to zsh users anyway,
and it's a useful feature.


Re: building on 32 extend 64 arch nix*

2008-03-17 Thread Hrvoje Niksic
mm w [EMAIL PROTECTED] writes:

 #if SIZEOF_VOID_P  4
   key += (key  44);
   key ^= (key  54);
   key += (key  36);
   key ^= (key  41);
   key += (key  42);
   key ^= (key  34);
   key += (key  39);
   key ^= (key  44);
 #endif

 this one is minor, the shift count is superior or equal to uintptr_t
 size, /* quad needed */

What is the size of uintptr_t on your platform?  If it is 4, the code
should not be compiled on that platform.  If it is 8, the shift count
should be correct.  If it is anything else, you have some work ahead
of you.  :-)

 the second one is in src/utils.c:1490
 and I think is more problematic, integer overflow in expression

There should be no integer overflow; I suspect SIZEOF_WGINT is
incorrectly defined for you.


Re: wget aborts when file exists

2008-03-13 Thread Hrvoje Niksic
Charles [EMAIL PROTECTED] writes:

 On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic [EMAIL PROTECTED] wrote:
   It assums, though, that the preexisting index.html corresponds to
   the one that you were trying to download; it's unclear to me how
   wise that is.

  That's what -nc does.  But the question is why it assumes that
  dependent files are also present.

 Because I repeated the command, and the files have all been downloaded
 before.

We know that, but Wget 1.11 doesn't seem to check it.  It only checks
index.html, but not the other dependent files.


Re: wget aborts when file exists

2008-03-12 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 When I tried this in my wget, I got different behavior with wget 1.11
 alpha and wget 1.10.2
 
 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 
 D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 File `localhost/test/a.gif' already there; not retrieving.
 
 File `localhost/test/b.gif' already there; not retrieving.
 
 File `localhost/test/c.jpg' already there; not retrieving.
 
 FINISHED --20:31:41--
 Downloaded: 0 bytes in 0 files
 
 I think wget 1.10.2 behavior is more correct. Anyway it did not abort
 in my case.

 I think I like the 1.11 behavior (I'm assuming it's intentional).

Let me recap to see if I understand the difference.  From the above
output, it seems that 1.10's -r descended into an HTML even if it was
downloaded.  1.11's -r assumes that if an HTML file is already there,
then so are all the other files it references.

If this analysis is correct, I don't see the benefit of the new
behavior.  If index.html happens to be present, it doesn't mean that
the files it references are also present.  I don't know if the change
was intentional, but it looks incorrect to me.

 It assums, though, that the preexisting index.html corresponds to
 the one that you were trying to download; it's unclear to me how
 wise that is.

That's what -nc does.  But the question is why it assumes that
dependent files are also present.


Re: Wget 1.11 on a IBM iSeries platform problems No Virus

2008-02-16 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:
 I agree that clock_getres itself isn't important.  Still, Wget needs
 to choose a clock that actually works out of several possible clocks
 allowed by POSIX (and common extensions), so it's advisable to at
 least attempt to use the clock in some way.  If clock_getres is known
 to fail on some platforms, then we should use clock_gettime instead.

 Instead? The only time we ever use clock_getres, AFAICT, is when
 clock_gettime

I referred to the use of clock_getres in posix_init, where it's used
to figure out which clock id to use as posix_clock_id.  We could use
clock_gettime there and completely remove the usage of clock_getres.


Re: Wget 1.11 on a IBM iSeries platform problems No Virus

2008-02-14 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 2) When I download files from a URL I get the following error:
 
 Cannot get REALTIME clock frequency: Invalid argument

 I can't tell you why that'd happen; Wget falls back to a clock id that
 should be guaranteed to exist. An erroneous time.h header would perhaps
 explain it.

 The error isn't serious, though, and may safely be ignored.

 In fact: Hrvoje? What do you think about removing that warning
 altogether (or, perhaps, increasing the verbosity level required to
 issue it)? AFAICT, the clock's resolution is used in only one place,

I agree that clock_getres itself isn't important.  Still, Wget needs
to choose a clock that actually works out of several possible clocks
allowed by POSIX (and common extensions), so it's advisable to at
least attempt to use the clock in some way.  If clock_getres is known
to fail on some platforms, then we should use clock_gettime instead.

I wonder if clock_gettime works for a clock for which clock_getres
fails.


Re: CS translation fix, p - bp-buffer = bp-width assert

2008-02-09 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 The prerelease still has a potential for crashes: in the Czech locales
 it will tend to crash if the download is large (or slow) enough to push
 minutes into the three-digit zone (that is, if it would take  1 hour
 and 40 minutes).

How can minutes get in the three-digit zone?  Anything longer than an
hour should be printed using hours and minutes.  Anything longer than
two days should be printed using days and hours.  Anything longer than
100 days should be printed with days only.

 I've removed even this potential in the current
 development sources (replacing it instead with an ugly, too-long
 progress bar that can either scroll the screen continually, or else
 leave an unerased character at the end of the line - but those are
 preferable to the crash, methinks).

Of course, but such a fix shouldn't be necessary in the first place.

The assert reveals sloppy coding on my part.  It would have been much
better to use snprintf-like code that refuses to write more than the
buffer size in the first place.

 Ideally, Wget wouldn't even scroll the screen, etc, in the face of
 two-long ETA strings; it should properly count how much space it
 has, rather than guesstimate.

It wasn't supposed to be guesstimation; eta_to_human_short was
carefully coded not to exceed the size constraint expected by
create_image.


Re: wget and sunsolve.sun.com

2008-02-08 Thread Hrvoje Niksic
Martin Paul [EMAIL PROTECTED] writes:

 Micah Cowan wrote:
 Then, how was --http-user, --http-passwd working in the past? Those only
 work with the underlying HTTP authentication protocol (the brower's
 unattractive popup dialog), which AFAIK can't be affected by CGI forms
 or JavaScript, etc.

 I must admit that I don't understand how it works - only Sun knows, I
 guess. Fact is that it accepts basic auth when it's being pushed at it
 by wget.

Very interesting.  I like the idea of a web interface supporting
*both* HTTP and cookie-based authentication.  There really should be a
way to force Wget to simply send basic authentication.  Maybe we
should differentiate between --http-user and --http-password and the
username/password being sent directly in the URL?  Whatever we do, it
might be impossible to both satisfy all use cases and avoid an
explicit option.


Re: Prerelease: Wget 1.11.1-b2080

2008-02-04 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Also: the fix to the locale/progress-bar issues resulted in the
 added use of a couple wide-character/multibyte-related functions,
 mbtowc and wcwidth.

So far Wget has avoided explicit use of wc/mb functions on the account
of portability.  Fortunately in most cases we don't really care about
character representation of strings we use, progress.c being a notable
exception.  I suggest that we check for availability of multibyte
functions and, where missing, disable NLS and define stubs such as:

#ifndef HAVE_WCHAR_T
typedef char wchar_t
#endif
int
mbtowc(wchar_t *pwc, const char *s, size_t n)
{
  if (!s)
return 0;
  if (pwc)
*pwc = *s;
  return *s != '\0';
}

int
wcwidth(wchar_t c)
{
  return 1;
}

Disabling NLS should automatically diasble the use of multibyte
functions, as well as (probably) the other way around.  That should
ensure that Wget remain usable on platforms where NLS or wide-char
support or both are broken.  After all, for most users (even those who
don't speak English) Wget remains a simple command-line network
utility, and translations are not of utmost importance to that target
audience.

 [wcwidth] is defined in SUS, but not POSIX,

FWIW, the Linux man page lists it as conforming to POSIX.1-2001.


Re: [PATCH] Reduce COW sections data by marking data constant

2008-02-01 Thread Hrvoje Niksic
Diego 'Flameeyes' Pettenò [EMAIL PROTECTED] writes:

 It is a micro-optimisation, I admit that, but it's not just the
 indirection the problem.

 Pointers, and structures containing pointers, need to be
 runtime-relocated for shared libraries and PIC code (let's assume
 that shared libraries are always PIC, for the sake of argument).

Even ignoring the fact that Wget is not a shared library, there are
ways to solve this problem other than turning all char *foo[] into
char foo[][MAXSIZE], which is, sorry, just lame and wasteful in all
but the most trivial examples.  The method described by Mart is one.
For readers of this list, it boils down to turning:

static const char *foo[] = {one, two, three };

into:

static const char foo_data[] = one\0two\0three;
static const foo_ind[] = {0, 4, 8};

static const char *foo(int ind) {
  return foo_data[foo_data + foo_ind[ind]];
}

(This technique was made popular by Ulrich Drepper.)

Maintaining the array of indices manually is cumbersome, but you could
write a tool that created the above three things from the original
const char *foo[] array without any user intervention.  But again,
that is total overkill for a non-PIC like Wget, and most likely for
smaller PICs as well.


Re: [PATCH] Reduce COW sections data by marking data constant

2008-02-01 Thread Hrvoje Niksic
Diego 'Flameeyes' Pettenò [EMAIL PROTECTED] writes:

 On 01/feb/08, at 09:12, Hrvoje Niksic wrote:

 Even ignoring the fact that Wget is not a shared library, there are
 ways to solve this problem other than turning all char *foo[] into
 char foo[][MAXSIZE], which is, sorry, just lame and wasteful in all
 but the most trivial examples.

 That's why I didn't turn _all_ of them, but just where the waste of
 space for the strings was very limited, or none at all.

I appreciate that, but from a maintainer's perspective that kind of
change adds a small maintenance burden for literally *no* gain.
Although the individual burden in each case is quite small, they tend
to accumulate.  On the other hand, the gain by this kind of change is
virtually zero and doesn't accumulate into a measurable performance
boost.  This kind of effort should be redirected toward shared
libraries where it might actually make a difference, especially for
those that are used by many programs and that contain many pointers.

Of course, the changes that introduce const without compromising
maintainability, such as the constification of the Wp declaration in
ftp-opie.c, are more than welcome.


Re: [PATCH] Reduce COW sections data by marking data constant

2008-02-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Note that you could also do all the pointer maths up-front, leaving
 existing usage code the same, with something like:

   static const char foo_data[] = one\0two\0three;
   static const char * const foo = {foo_data + 0, foo_data + 4,
 foo_data + 8};

I believe that doesn't help because foo[] remains an array of
pointers, each of which needs to be reloacated.


Re: [PATCH] Reduce COW sections data by marking data constant

2008-02-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Right. What I was meaning to prevent, though, is the need to do:

   foo[foo_data + foo_idx[i]]

 and instead do:

   foo[i]

That is why my example had a foo function, which turns foo[i] to
foo(i), but otherwise works the same.  Using just foo[i] is
unfortunately not possible because it requires either keeping all the
relocations (in which case the foo_data+foo_idx excercise doesn't make
sense) or resorting to the 2D array-of-char trick.


Re: wget running in windows Vista

2008-01-31 Thread Hrvoje Niksic
Christopher G. Lewis [EMAIL PROTECTED] writes:

 On Vista, you probably have to run in an administrative command
 prompt.

You mean that you need to be the administrator to run Wget?  If so,
why?  Surely other programs managed to access the network without
administrator privileges.


Re: Error with wget on AIX5.3

2008-01-23 Thread Hrvoje Niksic
Hopkins, Scott [EMAIL PROTECTED] writes:

   Worked perfect.  Thanks for the help.

Actually, I find it surprising that AIX's strdup would have such a
bug, and that it would go undetected.  It is possible that the problem
lies elsewhere and that the change is just masking the real bug.
strdup can be easily tested with a program such as:

#include stdio.h
#include string.h

int main()
{
  const char *empty = ;
  printf(%p\n, strdup(empty));
  return 0;
}

Please compile the program with the compiler and compilation flags
that Wget uses.  If it prints zero, it's an AIX strdup problem;
otherwise, the problem is probably somewhere else.


Re: Error with wget on AIX5.3

2008-01-23 Thread Hrvoje Niksic
Hopkins, Scott [EMAIL PROTECTED] writes:

 Interesting.  Compiled that code and I get the following when running
 the resulting binary. 

   /var/opt/prj/wget$ strdup_test
   20001448

As I suspected.  Such an obvious strdup bug would likely have been
detected sooner.

 I appear to have a functioning wget binary with the strdup change to
 config.h, but I'm curious what you think the other causes of this
 problem could be.

Hard to tell.  Some crashes, especially those resulting from memory
corruption bugs, can disappear when you change *anything* about the
build.  Of course, they tend to reappear later as well.  If you're
curious about debugging this, you can compile Wget with DEBUG_MALLOC
defined, which will at least catch some obvious errors, such as double
free.  Even better would be to run Wget under a real memory debugger
such as valgrind or purify, but I don't know if you have access to one
under AIX.


Re: Percentage in password

2007-12-15 Thread Hrvoje Niksic
Marcus [EMAIL PROTECTED] writes:

 Is there some way I can WGET to work with a percentage sign in the password?

 I.e. WGET ftp://login:[EMAIL PROTECTED]/file.txt

Yes, escape the percentage as %25:

wget ftp://login:[EMAIL PROTECTED]/file.txt

(This is not specific to Wget; '%' is the hex escape character in
URLs.)


Re: wget -Y0

2007-12-12 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 What's up with the -Y option?

IIRC it used to be the option to turn on the use of proxies.  I
retained it for compatibility because many people were using `-Y on'
in their scripts.  It might be the time to retire that option and only
leave the --no-proxy variant documented (since the default is true).

   -Y,  --proxy   explicitly turn on proxy.
--no-proxyexplicitly turn off proxy.

 The problem with that message is that it doesn't indicate the fact
 that -Y requires a boolean argument.
[...]
 If this is the case, shouldn't it also be removed from the output of
 --help as well? Otherwise, what would be the best way to indicate
 that it requires an argument?

To be consistent with how the help messages work, it should probably
say:

-Y,  --proxy=on/off  control use of proxy (normally enabled)

and rely that everyone has understood the Mandatory arguments to long
options are mandatory for short options too sentence.  But it's still
inconsistent with other boolean options, where the short form simply
enables the option (or disables it if enabled by default) and takes no
arguments.

-Y should probably just be removed.


Re: Content disposition question

2007-12-10 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Actually, the reason it is not enabled by default is that (1) it is
 broken in some respects that need addressing, and (2) as it is currently
 implemented, it involves a significant amount of extra traffic,
 regardless of whether the remote end actually ends up using
 Content-Disposition somewhere.

I'm curious, why is this the case?  I thought the code was refactored
to determine the file name after the headers arrive.  It certainly
looks that way by the output it prints:

{mulj}[~]$ wget www.cnn.com
[...]
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'   # not saving to only after the HTTP response

Where does the extra traffic come from?

 Note that it is not available at all in any release version of Wget;
 only in the current development versions. We will be releasing Wget 1.11
 very shortly, which will include the --content-disposition
 functionality; however, this functionality is EXPERIMENTAL only. It
 doesn't quite behave properly, and needs some severe adjustments before
 it is appropriate to leave as default.

If it is not ready for general use, we should consider removing it
from NEWS.  If not, it should be properly documented in the manual.  I
am aware that the NEWS entry claims that the feature is experimental,
but why even mention it if it's not ready for general consumption?
Announcing experimental features in NEWS is a good way to make testers
aware of them during the alpha/beta release cycle, but it should be
avoid in production releases of mature software.

 As to breaking old scripts, I'm not really concerned about that (and
 people who read the NEWS file, as anyone relying on previous
 behaviors for Wget should do, would just need to set
 --no-content-disposition, when the time comes that we enable it by
 default).

Agreed.


NEWS file

2007-12-10 Thread Hrvoje Niksic
I've noticed that the NEWS file now includes contents that would
previously not have been included.  NEWS was conceived as a resource
for end users, not for developers or distribution maintainers.  (Other
GNU software seems to follow a similar policy.)  I tried hard to keep
it readable by only including important or at least relevant entries,
sorted roughly by descending importance.  Developer information can be
obtained through other means: the web page, the version control logs,
and the detailed ChangeLogs we keep.

The recent entries were added to the front, without regard for
relative importance.  For example, NEWS now begins with announcement
of the move to Mercurial, the new Autoconf 2.61 requirement, and the
removal of PATCH and TODO files (!).  These entries are relevant
to developers, but almost completely meaningless to end users.

If there is a need to include developer information in NEWS, I suggest
that it be pushed to the bottom of the list, perhaps under a
Development information section.


GnuTLS

2007-12-10 Thread Hrvoje Niksic
If GnuTLS support will not be ready for the 1.11 release, may I
suggest that we not advertise it in NEWS?  After all, it's badly
broken in that it doesn't support certificate validation, which is one
of the most important features of an SSL client.  It also doesn't
support many of our SSL command-line options, which makes Wget almost
broken, https-wise, under GnuTLS.  IMO announcing such unfinished work
brings more harm than good in a stable release.


Re: Content disposition question

2007-12-10 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I thought the code was refactored to determine the file name after
 the headers arrive.  It certainly looks that way by the output it
 prints:
 
 {mulj}[~]$ wget www.cnn.com
 [...]
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 Saving to: `index.html'   # not saving to only after the HTTP response
 
 Where does the extra traffic come from?

 Your example above doesn't set --content-disposition;

I'm aware of that, but the above example was supposed to point out the
refactoring that has already taken place, regardless of whether
--content-disposition is specified.  As shown above, Wget always waits
for the headers before determining the file name.  If that is the
case, it would appear that no additional traffic is needed to get
Content-Disposition, Wget simply needs to use the information already
received.

 As to why this is the case, I believe it was so that we could
 properly handle accepts/rejects,

Issuing another request seems to be the wrong way to go about it, but
I haven't thought about it hard enough, so I could be missing a lot of
subtleties.

 I am aware that the NEWS entry claims that the feature is experimental,
 but why even mention it if it's not ready for general consumption?
 Announcing experimental features in NEWS is a good way to make testers
 aware of them during the alpha/beta release cycle, but it should be
 avoid in production releases of mature software.

 It's pretty much good enough; it's not where I want it, but it
 _is_ usable. The extra traffic is really the main reason I don't
 want it on-by-default.

It should IMHO be documented, then.  Even if it's documented as
experimental.


Re: Wget exit codes

2007-12-09 Thread Hrvoje Niksic
Gerard [EMAIL PROTECTED] writes:

 In particular, if Wget chooses not to download a file because the
 local timestamp is still current, or because its size corresponds
 to that of the remote file, these should result in an exit status
 of zero.

 I disagree. If wget has not downloaded a file, exiting with zero
 could lead the end user to believe that it had.

Specifying `-n' means download if needed.  There is no reason to
report a non-zero exit status if there was no need to perform the
download.  It is simply not an error condition, it is one of the two
success conditions (the other being download of the new contents).

 I disagree again. If wget did not download a file, no matter what
 the reason, then it should not exit with zero.  I have written
 several scripts that utilize wget to download files. Because wget
 fails to issue a useful code upon completion, I am forced to use
 hacks to find out what actually transpired.  Curl utilizes certain
 error codes, # 73 for instance, that are quite useful.

I agree that Wget should allow the caller to find out what happened,
but I don't think exit codes can be of much use there.  For one, they
don't allow distinction between different successful conditions,
which is a problem in many cases.  Also, their meaning is much harder
to define in presence of multiple downloads (wget URL1 URL2...).


Re: Wget exit codes

2007-12-09 Thread Hrvoje Niksic
R Kimber [EMAIL PROTECTED] writes:

 I agree that Wget should allow the caller to find out what
 happened, but I don't think exit codes can be of much use there.
 For one, they don't allow distinction between different
 successful conditions, which is a problem in many cases.

 I'm not sure I understand this. Why is it that there cannot be
 different exit codes for different 'successful' conditions?

Because by Unix convention success is indicated by exit status 0.
When a process exits with 0, scripts started with `sh -e' or tests
such as `wget URL || exit $?' won't fail.  Exiting with any non-zero
exit status on success would cause spurious failures to be reported.


Re: fnmatch and non-ASCII characters in .listing

2007-12-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:
 A Wget user showed me an example of Wget misbehaving.

 Hrvoje, do you know if this is a regression over 1.10.2?

I don't think so, but it's probably a regression over 1.9.x.  In 1.10
Wget started to set up the locale by calling setlocale(LC_ALL...), and
it's biting us in places like this.

 I'm guessing the use of fnmatch() being locale-dependant isn't; what
 about the rest?

The error message could be a regression, but I didn't actually check.
The faulty logic that uses fnmatch even when strcmp would sufffice
looks like it's been inherited since very early versions.

 And; do you think this is important enough to put in for 1.11,
 possibly delaying its release and risking further bugs; or should it
 wait a couple weeks for 1.11.x or 1.12?

If 1.11 is frozen, this is not important enough to break the freeze,
because it only happens in very specific circumstances.  It looks like
material for a 1.11.1.


Re: wget2

2007-11-30 Thread Hrvoje Niksic
Mauro Tortonesi [EMAIL PROTECTED] writes:

 I vote we stick with C. Java is slower and more prone to environmental
 problems.

 not really. because of its JIT compiler, Java is often as fast as
 C/C++, and sometimes even significantly faster.

Not if you count startup time, which is crucial for a program like
Wget.  Memory use is also incomparable.


Re: bug on wget

2007-11-21 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 The new Wget flags empty Set-Cookie as a syntax error (but only
 displays it in -d mode; possibly a bug).

 I'm not clear on exactly what's possibly a bug: do you mean the fact
 that Wget only calls attention to it in -d mode?

That's what I meant.

 I probably agree with that behavior... most people probably aren't
 interested in being informed that a server breaks RFC 2616 mildly;

Generally, if Wget considers a header to be in error (and hence
ignores it), the user probably needs to know about that.  After all,
it could be the symptom of a Wget bug, or of an unimplemented
extension the server generates.  In both cases I as a user would want
to know.  Of course, Wget should continue to be lenient towards syntax
violations widely recognized by popular browsers.

Note that I'm not arguing that Wget should warn in this particular
case.  It is perfectly fine to not consider an empty `Set-Cookie' to
be a syntax error and to simply ignore it (and maybe only print a
warning in debug mode).


Re: bug on wget

2007-11-20 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I was able to reproduce the problem above in the release version of
 Wget; however, it appears to be working fine in the current
 development version of Wget, which is expected to release soon as
 version 1.11.*

I think the old Wget crashed on empty Set-Cookie headers.  That got
fixed when I converted the Set-Cookie parser to use extract_param.
The new Wget flags empty Set-Cookie as a syntax error (but only
displays it in -d mode; possibly a bug).


Re: .1, .2 before suffix rather than after

2007-11-16 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:
  And how is .tar.gz renamed?  .tar-1.gz?
 Ouch.

 OK. I'm responding to the chain and not Hrvoje's expression of pain. :-)

 What if we changed the semantics of --no-clobber so the user could specify
 the behavior? I'm thinking it could accept the following strings:
 - after: append a number after the file name (current behavior)
 - before: insert a number before the suffix

But see Andreas's post quoted above: the term suffix is ambiguous.
In foo.tar.gz, what is the suffix?  How about .emacs.el?  And
Heroes.S203.DivX.avi?

Currently implemented name mangling is far from perfect, but it's easy
to understand, to recognize, and to reverse.  One other possibility
that offers the same features would be to put the number before the
file, such as 1.foo.html instead of foo.html.1; but that seems
hardly an improvement.

 - new: change name of new file (current behavior)
 - old: change name of old file

It would be nice to be able to change the name of the old file, but
when you start to consider the consequences, it gets trickier.  What
do you do when you have many files left over from previous runs, such
as foo, foo.1, foo.2, etc.?  Handling it correctly would trigger a
flurry of renames, which would need to be carried out in the correct
order, be prepared to handle a rename failing, and to detect changed
conditions in mid-run.  In general it seems like bad design to need to
touch many files in order to simply download one.  Maybe the improved
end user experience makes it worth it, but at this point I'm not
convinced of it.

 Back to the painful point at the start of this note, I think we
 treat .tar.gz as a suffix and if --no-clobber=before is specified,
 the file name becomes .1.tar.gz.

But see my other examples above.


Re: .1, .2 before suffix rather than after

2007-11-06 Thread Hrvoje Niksic
Andreas Pettersson [EMAIL PROTECTED] writes:

 And how is .tar.gz renamed?  .tar-1.gz?

Ouch.


Re: .1, .2 before suffix rather than after

2007-11-05 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 It just occurred to me that this change breaks backward compatibility.
 It will break scripts that try to clean up after Wget or that in any
 way depend on the current naming scheme.

 It may. I am not going to commit to never ever changing the current
 naming scheme.

Agreed, but there should be a very good reason for changing it, and
the change should be a clear improvement.  In my view, neither is the
case here.  For example, the change to respect the Content-Disposition
header constitutes a good reason[1].


Re: .1, .2 before suffix rather than after

2007-11-04 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Christian Roche has submitted a revised version of a patch to modify
 the unique-name-finding algorithm to generate names in the pattern
 foo-n.html rather than foo.html.n. The patch looks good, and
 will likely go in very soon.

foo.html.n has the advantage of simplicity: you can tell at a glance
that foo.n is a duplicate of foo.  Also, it is trivial to remove
the unwanted files by removing foo.*.  Why change what worked so
well in the past?

 A couple of minor detail questions: what do you guys think about using
 foo.n.html instead of foo-n.html?

Better, but IMHO not as good as foo.html.n.  But I'm obviously biased.
:-)


Re: .1, .2 before suffix rather than after

2007-11-04 Thread Hrvoje Niksic
Hrvoje Niksic [EMAIL PROTECTED] writes:

 Micah Cowan [EMAIL PROTECTED] writes:

 Christian Roche has submitted a revised version of a patch to modify
 the unique-name-finding algorithm to generate names in the pattern
 foo-n.html rather than foo.html.n. The patch looks good, and
 will likely go in very soon.

 foo.html.n has the advantage of simplicity: you can tell at a glance
 that foo.n is a duplicate of foo.  Also, it is trivial to remove
 the unwanted files by removing foo.*.

It just occurred to me that this change breaks backward compatibility.
It will break scripts that try to clean up after Wget or that in any
way depend on the current naming scheme.


Re: More portability stuff [Re: gettext configuration]

2007-10-27 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Or getting the definition requires defining a magic preprocessor
 symbol such as _XOPEN_SOURCE.  The man page I found claims that the
 function is defined by XPG4 and links to standards(5), which
 explicitly documents _XOPEN_SOURCE.

 Right. But we set that unconditionally in sysdep.h,

Only if you made it so.  The config-post.h code only set it on systems
where that's known to be safe, currently Linux and Solaris.  (The
reason was that some systems, possibly even Tru64, failed to compile
with _XOPEN_SOURCE set.)

Also note that Autoconf tests don't include sysdep.h, so the test
could still be failing.  It would be worth investigating why curl's
Autoconf test passes and ours (probably) doesn't.


Re: More portability stuff [Re: gettext configuration]

2007-10-27 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I can't even begin to fathom why some system would fail to compile
 in such an event: _XOPEN_SOURCE is a feature request, not a
 guarantee that you'll get some level of POSIX.

Yes, but sometimes the system headers are buggy.  Or sometimes they
work just fine with the system compiler, but not so well with GCC.  I
don't know which was the case at the time, but I remember that
compilation failed with _XOPEN_SOURCE and worked without it.

 Do you happen to remember the system?

If I remember correctly, the system was a (by current standards) old
version of Tru64.  The irony.  :-)

 I'd rather always define it, except for the systems where we know it
 fails, rather than just define it where it's safe.

I agree that that would be a better default now that many other
programs unconditionally define _XOPEN_SOURCE.  At the time I only
defined _XOPEN_SOURCE to get rid of compilation warnings under Linux
and Solaris.  After encountering the errors mentioned above, it seemed
safer to only define it where doing so was known not to cause
problems.


Re: More portability stuff [Re: gettext configuration]

2007-10-26 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Okay... but I don't see the logic of:

   1. If the system has POSIX's sigsetjmp, use that.
   2. Otherwise, just assume it has the completely unportable, and not
 even BSDish, siggetmask.

Are you sure siggetmask isn't BSD-ish?  When I tested that code on
various Unix systems, the only one without sigsetjmp was Ultrix, and
it had siggetmask.  Linux man page claims siggetmask to belong to the
BSD signal API and the headers expose it when _BSD_SOURCE is
defined.

 AFAIK, _no_ system supports POSIX 100%,

In case it's not obvious, I was trying to make the code portable to
real Unix and Unix-like systems.  So, the logic you don't see happened
to cover both POSIX and all non-POSIX systems I laid my hands on, or
heard of.  Wget was ported to *very* strange systems, and I don't
remember problems with run_with_timeout.

 At least sigblock(0) is more portable,

What makes you say that?

 And saying that VMS should implement its own completely separate
 run_with_timeout just

I know nothing of VMS.  If it's sufficiently different from Unix that
it has wildly different alarm/signal facilities, or no alarm/signal at
all (as is the case with Windows), then it certainly makes sense for
Wget to provide a VMS-specific run_with_timeout and use it on VMS.
Exactly as it's now done with Windows.

 because it lacks an unportable facility doesn't make sense--esides
 which, we're talking about a Unix here (Tru64), not VMS (yet).

Do you say that Tru64 lacks both sigsetjmp and siggetmask?  Are you
sure about that?


Re: More portability stuff [Re: gettext configuration]

2007-10-26 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I know nothing of VMS.  If it's sufficiently different from Unix that
 it has wildly different alarm/signal facilities, or no alarm/signal at
 all (as is the case with Windows), then it certainly makes sense for
 Wget to provide a VMS-specific run_with_timeout and use it on VMS.
 Exactly as it's now done with Windows.

 Not when we can use a more portabile facility to make both systems
 happy.

That's why I said *if* it's sufficiently different from Unix 
It obviously isn't if it only differs in the way that signal masks
need to be restored after longjmping from a signal handler.

 Doesn't have siggetmask() nor sigsetjmp() != wildly different
 alarm/signal facilities.

Of course.  I simply wasn't aware of such a case when I was writing
the code.  I'm not claiming the current code is perfect, I'm just
trying to explain the logic behind it.

 because it lacks an unportable facility doesn't make sense--esides
 which, we're talking about a Unix here (Tru64), not VMS (yet).
 
 Do you say that Tru64 lacks both sigsetjmp and siggetmask?  Are you
 sure about that?

 That is the only system we are currently talking about.

I find it hard to believe that Tru64 lacks both of those functions;
for example, see
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51_HTML/MAN/MAN3/0707.HTM

It is quite possible that the Autoconf test for sigsetjmp yields a
false negative.


Re: More portability stuff [Re: gettext configuration]

2007-10-26 Thread Hrvoje Niksic
Daniel Stenberg [EMAIL PROTECTED] writes:

 It is quite possible that the Autoconf test for sigsetjmp yields a
 false negative.

 I very much doubt it does, since we check for it in the curl
 configure script,

Note that I didn't mean in general.  Such bugs can sometimes show in
one program or test system, but not in another, depending on
previously run tests (which influence headers included by test
programs), version of Autoconf, or issues with the tester's
installation.

 and I can see the output from it running on Tru64 clearly state:

 checking for sigsetjmp... yes

It is my understanding that Steven got an error stating that
siggetmask is nonexistent, and siggetmask is used only if not
HAVE_SIGSETJMP.  Since, according to your test, Tru64 indeed does have
sigsetjmp, it only confirms my suspicion that Autoconf gets it wrong,
at least for that particular combination of Wget and the tester's
Tru64 installation.


Re: More portability stuff [Re: gettext configuration]

2007-10-26 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Note that curl provides the additional check for a macro version in
 the configure script, rather than in the source; we should probably
 do it that way as well. I'm not sure how that helps for this,
 though: if the above test is failing, then either it's a function
 (no macro) and configure isn't picking it up; or else it's not
 defined in setjmp.h.

Or getting the definition requires defining a magic preprocessor
symbol such as _XOPEN_SOURCE.  The man page I found claims that the
function is defined by XPG4 and links to standards(5), which
explicitly documents _XOPEN_SOURCE.


Re: config-post.h + gnulib breaks separate build dirs

2007-10-20 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Steven Schweda has started some testing on Tru64, and uncovered some
 interesting quirks; some of them look like flaws I've introduced,
 and others are bugginess in the Tru64 environment itself. It's
 proving very helpful. :)

Is the exchange off-list or on a list I'm not following?  I'm somewhat
interested in portability myself, so I'd like to follow it.


Re: config-post.h + gnulib breaks separate build dirs

2007-10-19 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Is there any reason we can't move the contents of config-post.h into
 sysdep.h, and have the .c files #include wget.h at the top, before any
 system headers?

wget.h *needs* stuff from the system headers, such as various system
types.  If you take into account that it includes sysdep.h, it needs
much more.

I don't see the problem with config-post.h, other than gnulib
brokenness.  It does exactly what it was designed to do.


Re: Port range option in bind-address implemented?

2007-10-19 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Yes, that appears to work quite well, as long as we seed it right;
 starting with a consistent X₀ would be just as bad as trying them
 sequentially, and choosing something that does not change several times
 a second (such as time()) still makes it likely that multiple
 invocations will choose the same first port. Probably, /dev/random as
 first choice, falling back to by gettimeofday() where that's available.
 I don't know what Windows would use.

Wget already contains high-resolution timer code for Windows; see
src/ptimer.c.


Re: config-post.h + gnulib breaks separate build dirs

2007-10-19 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Could you be more specific? AFAICT, wget.h #includes the system headers
 it needs. Considering the config-post.h stuff went at the top of the
 sysdep.h, sysdep.h is already at the top of wget.h,

OK, it should work then.  The reasoning behind my worrying is the
following: in some (rare) cases, you need to make decisions and define
preprocessor *before* including anything.  In other cases, you need to
base the decisions on the contents of header files, *after* having
included everything.  Case #1 used to be handled by config-post.h (and
in some cases config.h), and case #2 by sysdep.h.  You have now merged
them, which I don't necessarily see as a good thing.

 working fine on my system (passes make distcheck, which is _quite_
 rigorous)

That rigor has nothing to do with portability, though.  It only
demonstrates that Wget correctly builds on *your* system.


Re: version.c take two

2007-10-16 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 version.c:  $(wget_SOURCES) $(LDADD)
 printf '%s' 'const char *version_string = @VERSION@'  $@
 -hg log -r tip --template=' ({node|short})'  $@
 printf '%s\n' ';'  $@

printf is not portable to older systems, but that may not be a
problem anymore.  What are the current goals regarding portability?


Re: version.c take two

2007-10-16 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I may take liberties with the Make environment, and assume the
 presence of a GNU toolset, though I'll try to avoid that where it's
 possible.

Requiring the GNU toolset puts a large burden on the users of non-GNU
systems (both free and non-free ones).  Please remember that for many
Unix users and sysadmins Wget is one of the core utilities, to be
compiled very soon after a system is set up.  Each added build
dependency makes Wget that much harder to compile on a barebones
system.

 In cases like this, printf is much more portable (in behavior) than
 echo, but not as dependable (on fairly old systems) for presence;
 however, it's not a difficult tool to obtain, and I wouldn't mind
 making it a prerequisite for Wget (on Unix systems, at any rate). In
 a pinch, one could write an included tool (such as an echo command
 that does precisely what we expect) to help with building. But
 basically, if it's been in POSIX a good while, I'll probably expect
 it to be available for the Unix build.

Such well-intended reasoning tends to result with a bunch reports
about command/feature X not being present on the reporter's system, or
about a bogus version that doesn't work being picked up, etc.  But
maybe the times have changed -- we'll see.


Re: version.c take two

2007-10-16 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Alright; I'll make an extra effort to avoid non-portable Make
 assumptions then. It's just... portable Make _sucks_ (not that
 non-portable Make doesn't).

It might be fine to require GNU make if there is a good reason for it
-- many projects do.  But requiring random bits and pieces of the GNU
toolchain, such as one or more of GNU Bash, GNU grep, GNU tar, or,
well, printf :-), in most cases simply causes annoyance for very
little added value.  Junior developers, or those only exposed to
Linux, frequently simply assume that everyone has access to the tools
they use on their development system, and fail to document that
assumption.  I'm sure we can do better than that.


Re: [Patch] Plug some memleaks

2007-10-16 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Note that, technically, those are not leaks in real need of
 plugging because they get called only once, i.e. they do not
 accumulate (leak) unused memory.  Of course, it's still a good
 idea to remove them, if nothing else, then to remove false
 positives from DEBUG_MALLOC builds.

 I love that valgrind distinguishes these from  unreachable unfreed
 memory.

By the way, now that valgrind exists, it may be time to consider
retiring DEBUG_MALLOC, a venerable hack from pre-valgrind days.  I'm
kind of surprised that anyone even uses it.  :-)


Re: Version tracking in Wget binaries

2007-10-15 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Make my src changes, create a changeset... And then I'm lost...

 Alright, so you can make your changes, and issue an hg diff, and
 you've basically got what you used to do with svn.

That is not quite true, because with svn you could also do svn
commit to upload your changes on the global repository seen by
everyone.  It is my understanding that with the distributed VC's,
the moral equivalent of svn commit is only to be done by the
maintainer, by pulling (cherry-picking) the patches of various
contributors.  To me that sounds: a) horribly error-prone if the
maintainer doesn't have access to firewalled checkouts of various
contributors (patches can and do misapply), and b) actually *more*
centralized than CVS/svn!

It is most likely the case that I simply didn't (yet) get the DVCS
way of doing things.


Re: wget default behavior

2007-10-14 Thread Hrvoje Niksic
Tony Godshall [EMAIL PROTECTED] writes:

 OK, so let's go back to basics for a moment.

 wget's default behavior is to use all available bandwidth.

And so is the default behavior of curl, Firefox, Opera, and so on.
The expected behavior of a program that receives data over a TCP
stream is to consume data as fast as it arrives.


Re: PATCHES file removed

2007-10-13 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 FYI, I've removed the PATCHES file. Not because I don't think it's
 useful, but because the information needed updating (now that we're
 using Mercurial rather than Subversion), I expect it to be updated
 again from time to time, and the Wgiki seems to be the right place
 to keep changing documentation
 (http://wget.addictivecode.org/PatchGuidelines).

 It's still obviously useful to have patch-submission information
 included as part of the Wget distribution itself;

It would be nice for the distribution to contain that URL on a
prominent place, such as in the README, or even a stub PATCHES file.

 Speaking of which, I've replaced the MAILING-LISTS file,
 regenerating it from the Mailing Lists section of the Texinfo
 manual. I suspect it had previously been generated from source, but
 it's not clear to me from what (perhaps the web page?), or what tool
 was used.

It was simply hand-written.  :-)


Re: Version tracking in Wget binaries

2007-10-12 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Among other things, version.c is now generated rather than
 parsed. Every time make all is run, which also means that make
 all will always relink the wget binary, even if there haven't been
 any changes.

I personally find that quite annoying.  :-(  I hope there's a very
good reason for introducing that particular behavior.

BTW does that mean that, for example, running `make install', also
attempts to relink Wget?


Re: working on patch to limit to percent of bandwidth

2007-10-12 Thread Hrvoje Niksic
Tony Godshall [EMAIL PROTECTED] writes:

  available bandwidth and adjusts to that.  The usefullness is in
  trying to be unobtrusive to other users.

 The problem is that Wget simply doesn't have enough information to be
 unobtrusive.  Currently available bandwidth can and does change as new
 downloads are initiated and old ones are turned off.  Measuring
 initial bandwidth is simply insufficient to decide what bandwidth is
 really appropriate for Wget; only the user can know that, and that's
 what --limit-rate does.

 My patch (and the doc change in my patch) don't claim to be totally
 unobtrusive [...] Obviously people who the level of unobtrusiveness
 you define shouldn't be using it.

It was never my intention to define a particular level of
unobtrusiveness; the concept of being unobtrusive to other users was
brought up by Jim and I was responding to that.  My point remains that
the maximum initial rate (however you define initial in a protocol
as unreliable as TCP/IP) can and will be wrong in a large number of
cases, especially on shared connections.  Not only is it impossible to
be totally unobtrusive, but any *automated* attempts at being nice
to other users are doomed to failure, either by taking too much (if
the download starts when you're alone) or too little (if the download
starts with shared connection).


Re: working on patch to limit to percent of bandwidth

2007-10-12 Thread Hrvoje Niksic
Tony Godshall [EMAIL PROTECTED] writes:

 My point remains that the maximum initial rate (however you define
 initial in a protocol as unreliable as TCP/IP) can and will be
 wrong in a large number of cases, especially on shared connections.

 Again, would an algorithm where the rate is re-measured periodically
 and the initial-rate-error criticism were therefore addressed reduce
 your objection to the patch?

Personally I don't see the value in attempting to find out the
available bandwidth automatically.  It seems too error prone, no
matter how much heuristics you add into it.  --limit-rate works
because reading the data more slowly causes it to (eventually) also be
sent more slowly.  --limit-percentage is impossible to define in
precise terms, there's just too much guessing.


Re: working on patch to limit to percent of bandwidth

2007-10-10 Thread Hrvoje Niksic
Jim Wright [EMAIL PROTECTED] writes:

 - --limit-rate will find your version handy, but I want to hear from
 them. :)

 I would appreciate and have use for such an option.  We often access
 instruments in remote locations (think a tiny island in the Aleutians)
 where we share bandwidth with other organizations.

A limitation in percentage doesn't make sense if you don't know
exactly how much bandwidth is available.  Trying to determine full
bandwidth and backing off from there is IMHO doomed to failure because
the initial speed Wget gets can be quite different from the actual
link bandiwdth, at least in a shared link scenario.  A --limit-percent
implemented as proposed here would only limit the retrieval speed to
the specified fraction of the speed Wget happened to get at the
beginning of the download.  That is not only incorrect, but also quite
non-deterministic.

If there were way to query the network for the connection speed, I
would support the limit-percent idea.  But since that's not
possible, I think it's better to stick with the current --limit-rate,
where we give the user an option to simply tell Wget how much
bandwidth to consume.


Re: working on patch to limit to percent of bandwidth

2007-10-10 Thread Hrvoje Niksic
Jim Wright [EMAIL PROTECTED] writes:

 I think there is still a case for attempting percent limiting.  I
 agree with your point that we can not discover the full bandwidth of
 the link and adjust to that.  The approach discovers the current
 available bandwidth and adjusts to that.  The usefullness is in
 trying to be unobtrusive to other users.

The problem is that Wget simply doesn't have enough information to be
unobtrusive.  Currently available bandwidth can and does change as new
downloads are initiated and old ones are turned off.  Measuring
initial bandwidth is simply insufficient to decide what bandwidth is
really appropriate for Wget; only the user can know that, and that's
what --limit-rate does.


Re: bug in escaped filename calculation?

2007-10-04 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 It is actually illegal to specify byte values outside the range of
 ASCII characters in a URL, but it has long been historical practice
 to do so anyway. In most cases, the intended meaning was one of the
 latin character sets (usually latin1), so Wget was right to do as it
 does, at that time.

Your explanation is spot-on.  I would only add that Wget's
interpretation of what is a control character is not so much geared
toward Latin 1 as it is geared toward maximum safety.  Originally I
planned to simply encode *all* file name characters outside the 32-127
range, but in practice it was very annoying (not to mention
US-centric) to encode perfectly valid Latin 1/2/3/... as %xx.  Since
the codes 128-159 *are* control characters (in those charsets) that
can mess up your screen and that you wouldn't want seen by default, I
decided to encode them by default, but allow for a way to turn it off,
in case someone used a different charset.

In the long run, supporting something like IRL is surely the right
thing to go for, but I have a feeling that we'll be stuck with the
current messy URLs for quite some time to come.  So Wget simply needs
to adapt to the current circumstances.  If the locale includes UTF-8
in any shape or form, it is perfectly safe to assume that it's valid
to create UTF-8 file names.  Of course, we don't know if a particular
URL path sequence is really meant to be UTF-8, but there should be no
harm in allowing valid UTF-8 sequences to pass through.  In other
words, the default quote control policy could simply be smarter
about what control means.

One consequence would be that Wget creates differently-named files in
different locales, but it's probably a reasonable price to pay for not
breaking an important expectation.  Another consequence would be
making users open to IDN homograph attacks, but I don't know if that's
a problem in the context of creating file names (IDN is normally
defined as a misrepresentation of who you communicate with).

For those who want to hack on this, the place to look at is
url.c:append_uri_pathel; that strangely-named function takes a path
element (a directory name or file name component of the URL) and
appends it to the file name.  It takes care not to ever use .. as a
path component and to respect the --restrict-file-names setting as
specified by the user.  It could be made to recognize UTF-8 character
sequences in UTF-8 locales and exempt valid UTF-8 chars from being
treated as control characters.  Invalid UTF-8 chars would still pass
all the checks, and non-canonical UTF-8 sequences would be rejected
(by condemning their byte values to being escaped as %..).  This is
not much work for someone who understands the basics of UTF-8.


[fwd] Wget Bug: recursive get from ftp with a port in the url fails

2007-09-17 Thread Hrvoje Niksic
---BeginMessage---
Hi,I am using wget 1.10.2 in Windows 2003.And the same problem like Cantara.
The file system is NTFS.
Well I find my problem is, I wrote the command in schedule tasks like this:

wget  -N -i D:\virus.update\scripts\kavurl.txt -r -nH -P
d:\virus.update\kaspersky

well, after wget,and before -N, I typed TWO spaces.

After delete one space, wget works well again.

Hope this can help.

:)

-- 
from:baalchina
---End Message---


Re: --post-data encoding

2007-09-03 Thread Hrvoje Niksic
control H [EMAIL PROTECTED] writes:

 After a few hours of headache I found out my --post-data option
 didn't work as I expected because the data I send has to be
 URL-escaped. This is not mentioned both in the manpage and inline
 help. A remark would be helpful.

Note that, in general, it doesn't.  POST requests are a generic
mechanism for transferring data, and a valid POST request can contain
an (unencoded) XML document or even binary data in its body.  URL
encoding is only necessary when transferring HTML form data from a
client to the server.  Wget doesn't assume that this is the case --
the POST options are designed as a low-level tool which the user is
expected to understand how to use.

I now see that this is not the most useful design for most people; for
one, the manual could at least document the typical usage.  It is also
inconsistent because Wget automatically sends Content-Type:
application/x-www-form-urlencoded when one of the POST options is in
use, which indicates that the primary usage for POST was uploading
form data.

One way to solve this is to introduce higher-level functionality, such
as --form-data and --form-attach (for uploading files) which construct
a POST request suitable for sending form data, so the user doesn't
have to.  In that case --post-data and --post-file would no longer
need to set content-type to application/x-www-form-urlencoded.


Re: wget 1.10.2 (warnings)

2007-08-24 Thread Hrvoje Niksic
Esin Andrey [EMAIL PROTECTED] writes:

 Hi!
 I have downloaded wget-1.10.2 sources and try to compile it.
 I have some warnings:

 /|init.c: In function ‘cmd_spec_prefer_family’
 init.c:1193: warning: доступ по указателю с приведением типа нарушает правила 
 перекрытия объектов в памяти

 |/I have wrote patch which correct this warnings (It is in attach)

Thank you for the report.  I don't understand the warning, but there
are problems with your patch.  You cannot cast opt.prefer_family to
int * because the value of opt.prefer_family is not a pointer.
Likewise, you cannot change assignment to *place to assignment to
place, since then you're only changing the value of a local variable,
rendering the code no-op.

I agree that warnings should be fixed, but one must be careful not to
break code in the process.


Re: FTP OS-dependence, and new FTP RFC

2007-08-04 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I have a question: why do we attempt to generate absolute paths and
 such and CWD to those, instead of just doing the portable
 string-of-CWDs to get where we need to be?

I think the original reason was that absolute paths allow crossing
from any directory to any other directory without using ...  This is
needed by the recursive download code, which downloads from multiple
directories.

I agree that string-of-CWDs would be better than the current solution.


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I agree that it's probably a good idea to move HTML parsing to a model
 that doesn't require slurping everything into memory;

Note that Wget mmaps the file whenever possible, so it's not actually
allocated on the heap (slurped).  You need some memory to store the
URLs found in the file, but that's not really avoidable.  I agree that
it would be better to completely avoid the memory-based model, as it
would allow links to be extracted on-the-fly, without saving the file
at all.  It would be an interesting excercise to write or integrate a
parser that works like that.

Regarding limits to file size, I don't think they are a good idea.
Whichever limit one chooses, someone will find a valid use case broken
by the limit.  Even an arbitrary limit I thought entirely reasonable,
such as the maximum redirection count, recently turned out to be
broken by design.  In this case it might make sense to investigate
exactly where and why the HTML parser spends the memory; perhaps the
parser saw something it thought was valid HTML and tried to extract a
huge link from it?  Maybe the parser simply needs to be taught to
perform sanity checks on URLs it encounters.


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Yes, but when mmap()ping with MEM_PRIVATE, once you actually start
 _using_ the mapped space, is there much of a difference?

As long as you don't write to the mapped region, there should be no
difference between shared and private mapped space -- that's what copy
on write (explicitly documented for MAP_PRIVATE in both Linux and
Solaris mmap man pages) is supposed to accomplish.  I could have used
MAP_SHARED, but at the time I believe there was still code that relied
on being able to write to the buffer.  That code was subsequently
removed, but MAP_PRIVATE stayed because I saw no point in removing it.
Given the semantics of copy on write, I figured there would be no
difference between MAP_SHARED and unwritten-to MAP_PRIVATE.

As for the memory footprint getting large, sure, Wget reads through it
all, but that is no different from what, say, grep --mmap does.  As
long as we don't jump backwards in the file, the OS can swap out the
unused parts.  Another difference between mmap and malloc is that
mmap'ed space can be reliably returned to the system.  Using mmap
pretty much guarantees that Wget's footprint won't increase to 1GB
unless you're actually reading a 1GB file, and even then much less
will be resident.

 mmap() isn't failing; but wget's memory space gets huge through the
 simple use of memchr() (on '', for instance) on the mapped address
 space.

Wget's virtual memory footprint does get huge, but the resident memory
needn't.  memchr only accesses memory sequentially, so the above swap
out scenario applies.  More importantly, in this case the report
documents failing to allocate -2147483648 bytes, which is a malloc
or realloc error, completely unrelated to mapped files.

 Still, perhaps a better way to approach this would be to use some
 sort of heuristic to determine whether the file looks to be
 HTML. Doing this reliably without breaking real HTML files will be
 something of a challenge, but perhaps requiring that we find
 something that looks like a familiar HTML tag within the first 1k or
 so would be appropriate. We can't expect well-formed HTML, of
 course, so even requiring an HTML tag is not reasonable: but
 finding any tag whatsoever would be something to start with.

I agree in principle, but I'd still like to know exactly what went
wrong in the reported case.  I suspect it's not just a case of
mmapping a huge file, but a case of misparsing it, for example by
attempting to extract a URL hundreds of megabytes' long.


Re: text/html assumptions, and slurping huge files

2007-08-01 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Actually, I was wrong though: sometimes mmap() _is_ failing for me
 (did just now), which of course means that everything is in resident
 memory.

I don't understand why mmapping a regular would fail on Linux.  What
error code are you getting?

(Wget tries to handle mmap failing gracefully because the GNU coding
standards require it, and also because we support mmap-unaware
operating systems anyway.)


Re: Wget and Automake

2007-07-20 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

   - Automated packaging and package-testing

What packaging does this refer to exactly?

   - Automatic support for a wide variety of configuration and build
 scenarios, such as configuring or building from a location other than
 the source directory tree, or the DESTDIR late installation-location
 variable.

This has worked with Autoconf-enabled programs (including Wget) for
ages.

   - Complicated

 I actually don't find this to be true. The arguments to this effect seem
 to refer to the generated Makefiles but you don't _edit_ those.

I think this is the crucial point.  If you really find the generated
Makefiles to be managable, both in case when you need to edit them by
hand (for whatever reason) and in the case when you need to understand
them (to tell why something went wrong or to fix a problem), then
Automake is the right choice.  I find Automake-generated Makefiles to
be completely unreadable and immutable.  The only ones that even come
close are the Makefiles generated by imake, and autotools were
supposed to be a step forward.

 In terms of actually writing the Makefile.am documents, though, in
 general it is actually much _easier_ than writing the plain Makefile
 equivalents.

As long as what you want to do is supported by Automake, yes.

 I obviously wouldn't be looking to make the move for our upcoming
 1.11 release in September; but I would desire to make the move soon
 thereafter. Since this was apparently something that some people
 felt strongly about, I thought it'd be wise to broach the subject
 now, so we have plenty of time to discuss it. So, please speak up!

I don't think you will find hard technical arguments one way or the
other; at this point the choice seems a matter of taste more than
anything else.  And as always in such matters, who does the work gets
to make the call.  Either way, I'll certainly support your decision.


Re: Why --exclude-directories, and not --exclude-paths?

2007-07-19 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I don't know.  The reason directories are matched separately from
 files is because files often *don't* match the pattern you've chosen
 for directories.  For example, -X/etc should exclude anything under
 /etc, such as /etc/passwd, but also /etc/foo/bar/baz.  Since '*' in
 shell-like globs doesn't match '/', it is impossible to write a glob
 expression to match all files arbitrarily deep under a directory.

 Perhaps a ** wildcard, that matches /, would be useful; then we could
 combine the meanings of -X and -R (and, -I and -A).

That sounds pretty neat.  I wonder if it would be compatible with
zsh's **, which matches any number of subdirectories.  (For example,
zsh's **/*.c is roughly equivalent to `find . -name *.c` in other
shells.)


Re: ignoring robots.txt

2007-07-18 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I think we should either be a stub, or a fairly complete manual
 (and agree that the latter seems preferable); nothing half-way
 between: what we have now is a fairly incomplete manual.

Converting from Info to man is harder than it may seem.  The script
that does it now is basically a hack that doesn't really work well
even for the small part of the manual that it tries to cover.

What makes it harder is the impedance mismatch between Texinfo and
Unix manual philosophies.  What is appropriate for a GNU manual, for
example tutorial-style nodes, a longish FAQ section, or the inclusion
of the entire software license, would be completely out of place in a
man page.  (This is a consequence of Info being hyperlinked, which
means that it's easier to skip the nodes one is not interested in, at
least in theory.)  On the other hand, information crucial to any man
page, such as clearly delimited sections that include SYNOPSIS,
DESCRIPTION, FILES or SEE ALSO, might not be found in a Texinfo
document at all, at least not in an easily recognizable and
extractable form.

As for the stub man page... Debian for one finds it unacceptable,
and I can kind of understand why.  When I pulled the man page out of
the distribution, Debian's solution was to keep maintaining the old
man page and disttributing it with their package.  As a result, any
Debian user who issued `man wget' would read Debian-maintained man
page and was at the mercy of the Debian maintainer to have ensured
that the man page was updated as new features arrived.  Since most
Unix users only read the man page and never bother with Info, this was
suboptimal -- a crucial piece of documentation was not inherited from
the project, but produced by Debian.  (I further didn't like that the
maintainer used my original man page even though I explicitly asked
them not to, but that's another matter.)

When the Debian maintainer stepped down, I agreed with his successor
to a compromise solution: that a man page would be automatically
generated from the Info documentation which would contain at least a
fairly complete list of command-line options.  It was far from
perfect, but it was still better than nothing, and it was deemed Good
Enough.  Note that I'm not saying the current solution is good enough
-- it isn't.  I'm just providing a history of how the current state of
the affairs came to be.


Re: Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Converting from Info to man is harder than it may seem.  The script
 that does it now is basically a hack that doesn't really work well
 even for the small part of the manual that it tries to cover.

 I'd noticed. :)

 I haven't looked at the script that does this work; I had assumed
 that it was some standard tool for this task, but perhaps it's
 something more custom?

Our `texi2pod' comes from GCC, and it would seem that it was written
for GCC/binutils.  The version in the latest binutils is almost
identical to what Wget ships, plus a bug fix or two.  Given its state
and capabilities, I doubt that it is widely used, so unfortunately
it's pretty far from being a standard tool.  I would have much
preferred to use a standard tool, but as far as I knew none was
available at the time.  In fact, I'm not aware of one now, either.

 As for the stub man page... Debian for one finds it unacceptable,
 and I can kind of understand why.

 Yeah, especially since they're frequently forced to leave out the
 authoritative manual.

This issue predates the GFDL debacle by several years, but yes, if
anything, things have gotten worse in that department.  Much worse.


Re: No more dev change posts to wget-patches?

2007-07-17 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I would like for devs to be able to avoid the hassle of posting
 non-trivial changes they make to the wget-patches list. To my mind,
 there are two ways of accomplishing this:

 1. Make wget-patches a list _only_ for submitting patches for
 consideration by devs, no longer with the additional purpose of
 communicating changes from the devs to the users.

I don't think wget-patches was ever meant for communicating changes
from the developers to the users.  The main wget list was supposed to
be used for that.  As far as I'm aware, wget-patches was always a list
meant for receiving patches (and possibly tracking them).

 If using wget-patches to communicate changes that have been made is
 in fact still a useful thing, then option #2 would be best. However,
 it's not clear to me that a significant number of people are
 actually reading wget-patches for this purpose, in which case any
 that do want to know about such changes are probably better off
 subscribing to wget-notify to see them, and I should employ option
 #1.

I think I agree with your reasoning.


Re: Why --exclude-directories, and not --exclude-paths?

2007-07-17 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Someone just asked on the #wget IRC channel if there was a way to
 exclude files with certain names, and I recommended -X, without
 realizing that that option excludes directories, not files.

 My question is: why do we allow users to exclude directories, but
 not files?

-R allows excluding files.  If you use a wildcard character in -R, it
will treat it as a pattern and match it against the entire file name.
If not, it will treat it as a suffix (not really an extension, it
doesn't care about . being there or not).  -X always excludes
directories and allows wildcards.

It was supposed to be a DWIM thing.


Re: Why --exclude-directories, and not --exclude-paths?

2007-07-17 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Yes, but -R has a lesser degree of control over the sorts of
 pathnames that it can constrain: for instance, if one uses
 -Rmyprefix*, it will match files myprefix-foo.html and
 myprefix-bar.mp3; but it will also match notmyprefix.js, which is
 probably not what the user desired.

Are you sure?  It certainly wasn't designed that way.  Your example
should only exclude files beginning with myprefix.

 It was supposed to be a DWIM thing.

 Where what I mean is what? :)

Suffix when no wildcards are used, pattern when they are.

 My question is: what use cases are there in which one would want to
 exclude directories, but not files that match that pattern?

I don't know.  The reason directories are matched separately from
files is because files often *don't* match the pattern you've chosen
for directories.  For example, -X/etc should exclude anything under
/etc, such as /etc/passwd, but also /etc/foo/bar/baz.  Since '*' in
shell-like globs doesn't match '/', it is impossible to write a glob
expression to match all files arbitrarily deep under a directory.

 I see the utility of a general matcher against file names in
 general; I'm not sure I see much utility of a separate option to
 match against just directories.

I hope the above clears it up.  Choosing a matcher that doesn't
special-case '/' might be sufficient.

 How much potential harm would it cause to replace the current
 behavior of -X to do the equivalent to Josh's --exclude-files?

It will break usage such as the above -X/etc example.


Re: wget-patches status?

2007-07-06 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 What is the status of the wget-patches list: is it being actively
 used/monitored? Does it still serve its original purpose?

Mauro and I are subscribed to it.  The list served its purpose while
Wget was actively maintained.  It's up to you whether to preserve it
or replace it with a bug tracker patch submission process.

 A brief glance at the archives seems to suggest that, for one reason
 or another, it may be suffering a larger spam problem than the main
 list; is this accurate?

It's true.  The main Wget list allows posting from non-subscribers,
but requires an authentication response; that has worked well to
prevent spam.  The patches list doesn't have such a mechanism
installed, which results in more spam.  (Of course, it still uses the
general antispam filter installed on the site, or the quantity of spam
would be unbearable.)


Re: wget-patches status?

2007-07-06 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Mauro and I are subscribed to it.  The list served its purpose while
 Wget was actively maintained.  It's up to you whether to preserve it
 or replace it with a bug tracker patch submission process.

 Given the low incidence of patch submission, is there any reason why we
 can't accept patch submissions on the main list?

I think the original reasoning was that patches can be large and some
people don't like receiving large attachments in the mail.  Also, it
would (in theory) have been easier for someone only interested in the
patches, such as Linux distribution maintainers, to only follow the
patches list.  But with the current mail capacities and with the
advent of public version control servers, that doesn't seem necessary.

 Would it be useful to implement the same authentication process for
 wget-patches; or was it intended to make things easier for
 drive-by patchers?

I think it would be perfectly fine to implement the same level of
protection there.  In fact, most free software mailing lists are much
more annoying: they require you to *subscribe* (or register into a bug
tracker) merely to send a bug report or a patch.  Compared to that
hassle, asking for a confirmation email is negligible.


Re: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 There is a buffer overflow in the following line of the proposed code:

  sprintf(filecopy, \%.2047s\, file);

Wget has an `aprintf' utility function that allocates the result on
the heap.  Avoids both buffer overruns and arbitrary limits on file
name length.


Re: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Hrvoje Niksic
Rich Cook [EMAIL PROTECTED] writes:

 Trouble is, it's undocumented as to how to free the resulting
 string.  Do I call free on it?

Yes.  Freshly allocated with malloc in the function documentation
was supposed to indicate how to free the string.


Re: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Hrvoje Niksic
Virden, Larry W. [EMAIL PROTECTED] writes:

 Tony Lewis [EMAIL PROTECTED] writes:

 Wget has an `aprintf' utility function that allocates the result on
 the heap.  Avoids both buffer overruns and 
 arbitrary limits on file name length.

 If it uses the heap, then doesn't that open a hole where a particularly
 long file name would overflow the heap?

No, aprintf tries to allocate as much memory as necessary.  If the
memory is unavailable, malloc returns NULL and Wget exits.


Re: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Hrvoje Niksic
Rich Cook [EMAIL PROTECTED] writes:

 On Jul 5, 2007, at 11:08 AM, Hrvoje Niksic wrote:

 Rich Cook [EMAIL PROTECTED] writes:

 Trouble is, it's undocumented as to how to free the resulting
 string.  Do I call free on it?

 Yes.  Freshly allocated with malloc in the function documentation
 was supposed to indicate how to free the string.

 Oh, I looked in the source and there was this xmalloc thing that
 didn't show up in my man pages, so I punted.  Sorry.

No problem.  Note that xmalloc isn't entirely specific to Wget, it's a
fairly standard GNU name for a malloc-or-die function.

Now I remembered that Wget also has xfree, so the above advice is not
entirely correct -- you should call xfree instead.  However, in the
normal case xfree is a simple wrapper around free, so even if you used
free, it would have worked just as well.  (The point of xfree is that
if you compile with DEBUG_MALLOC, you get a version that check for
leaks, although it should be removed now that there is valgrind, which
does the same job much better.  There is also the business of barfing
on NULL pointers, which should also be removed.)

I'd have implemented a portable asprintf, but I liked the aprintf
interface better (I first saw it in libcurl).


Re: New wget maintainer

2007-06-27 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 The GNU Project has appointed me as the new maintainer for wget,

Welcome!

If you need assistance regarding the workings of the internals or
design decisions, please let me know and I'll gladly help.  I haven't
had much time to participate lately, but hopefully I'll have more time
in the following months.


Re: Crash

2007-05-29 Thread Hrvoje Niksic
Adrian Sandor [EMAIL PROTECTED] writes:

 Thanks a lot Steven,

 Apparently there's more than a little code in src/cookies.c which is
 not ready for NULL values in the attr and value members of the
 cookie structure.

 Does that mean wget is buggy or does brinkster break the cookie
 specification?

Probably both, but Wget is definitely buggy since it crashes.  Looking
at the code, it would appear the bug has been fixed in the repository.


Re: Loading cookies that were set by Javascript

2007-05-18 Thread Hrvoje Niksic
George Pavlov [EMAIL PROTECTED] writes:

  Permanent cookies are supposed to be present in cookies.txt, and
  Wget will use them.  Session cookies will be missing (regardless
  of how they were set) from the file and therefore will not be
  picked up by Wget.

 This is not entirely true. You can use --keep-session-cookies

What I meant is that the session cookies will not be saved by the
browser to cookies.txt, regardless of whether they were set by the
server or by Javascript.  I was assuming the OP was already using a
browser-created cookies.txt.


Re: Loading cookies that were set by Javascript

2007-05-16 Thread Hrvoje Niksic
Poppa Pump [EMAIL PROTECTED] writes:

 Now I also need to load 2 more cookie values, but these are set
 using Javascript. Does anyone know how to set those cookies. I can't
 seem to find any info on this. Thanks for your help.

Wget doesn't really distinguish the cookies set by Javascript from
those set otherwise.  Permanent cookies are supposed to be present in
cookies.txt, and Wget will use them.  Session cookies will be missing
(regardless of how they were set) from the file and therefore will not
be picked up by Wget.

If you know which cookies the site is adding, and you can check that
using your browser's cookie manager, you can always add them manually
to cookies.txt and rerun Wget.


Re: Requests are always HTTP/1.0 ?!

2007-05-02 Thread Hrvoje Niksic
Greg Lindahl [EMAIL PROTECTED] writes:

 Host: kpic1 is a HTTP/1.1 feature. So this is non-sensical.

The `Host' header was widely used with HTTP/1.0, which is how it
entered the HTTP/1.1 spec.

For other reasons, Wget should really upgrade to using HTTP/1.1.


Re: [PATCH] cross-mingw32 support

2007-04-01 Thread Hrvoje Niksic
Robert Millan [EMAIL PROTECTED] writes:

 -AC_CHECK_FUNCS(strtoll usleep ftello sigblock sigsetjmp memrchr)
 +AC_CHECK_FUNCS(strtoll usleep ftello sigblock sigsetjmp memrchr strcasecmp 
 strncasecmp strdup isatty symlink)
  
 -dnl We expect to have these functions on Unix-like systems configure
 -dnl runs on.  The defines are provided to get them in config.h.in so
 -dnl Wget can still be ported to non-Unix systems (such as Windows)
 -dnl that lack some of these functions.
 -AC_DEFINE([HAVE_STRCASECMP], 1, [Define to 1 if you have the `strcasecmp' 
 function.])
 -AC_DEFINE([HAVE_STRNCASECMP], 1, [Define to 1 if you have the `strncasecmp' 
 function.])
 -AC_DEFINE([HAVE_STRDUP], 1, [Define to 1 if you have the `strdup' function.])
 -AC_DEFINE([HAVE_ISATTY], 1, [Define to 1 if you have the `isatty' function.])
 -AC_DEFINE([HAVE_SYMLINK], 1, [Define to 1 if you have the `symlink' 
 function.])
 -

I don't like forcing all Unix systems to go through these completely
unnecessary checks only for the sake of rare Win32 cross-compilation.


Re: -i option

2007-03-29 Thread Hrvoje Niksic
Eugene Homyakov [EMAIL PROTECTED] writes:

 Could you please make -i option accept URL? This is useful when
 downloading m3u's

Note that you can easily chain Wget invocations, e.g.

wget -qO- URL | wget -i-


Re: wget-1.10.2 pwd/cd bug

2007-03-27 Thread Hrvoje Niksic
Hrvoje Niksic [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED] (Steven M. Schweda) writes:

It's starting to look like a consensus.  A Google search for:
 wget DONE_CWD
 finds:

   http://www.mail-archive.com/wget@sunsite.dk/msg08741.html

 That bug is fixed in subversion, revision 2194.

I forgot to add that this means that the patch can be retrieved with
`svn diff -r2193:2194' in Wget's source tree.  If you don't have a
checkout handy, Subversion still allows you to generate a diff using
`svn diff -r2193:2194 http://svn.dotsrc.org/repo/wget/trunk/'.

Also note that the fix is also available on the stable branch, and I
urge the distributors to apply it to their versions until 1.10.3 or
1.11 is released.


Re: wget-1.10.2 pwd/cd bug

2007-03-25 Thread Hrvoje Niksic
[EMAIL PROTECTED] (Steven M. Schweda) writes:

It's starting to look like a consensus.  A Google search for:
 wget DONE_CWD
 finds:

   http://www.mail-archive.com/wget@sunsite.dk/msg08741.html

That bug is fixed in subversion, revision 2194.


Re: Patch for Windows Build

2007-02-11 Thread Hrvoje Niksic
Applied, thanks.  Sorry about the delay.


Re: utils.c:get_grouping_data calls strdup with a null pointer

2007-02-11 Thread Hrvoje Niksic
Thanks for the report.  Please note that your patch sets the thousands
separator to C, which is probably not what you had in mind.  I'm
about to apply a slightly different patch to deal with the problem you
describe:

2007-02-11  Hrvoje Niksic  [EMAIL PROTECTED]

* utils.c (get_grouping_data): Cope with systems where
localeconv() doesn't initialize lconv-thousand_sep and/or
lconv-grouping.  Based on report by Dirk Vanhaute.

Index: src/utils.c
===
--- src/utils.c (revision 2205)
+++ src/utils.c (working copy)
@@ -1215,8 +1215,8 @@
   /* Get the grouping info from the locale. */
   struct lconv *lconv = localeconv ();
   cached_sep = lconv-thousands_sep;
-  cached_grouping = lconv-grouping;
-  if (!*cached_sep)
+  cached_grouping = lconv-grouping ? lconv-grouping : \x03;
+  if (!cached_sep || !*cached_sep)
{
  /* Many locales (such as C or hr_HR) don't specify
 grouping, which we still want to use it for legibility.


Re: wget -S dies with wget: realloc: Failed to allocate-2147483648 bytes; memory exhausted. because of 8-bit characters in HTTPheaders

2007-02-02 Thread Hrvoje Niksic
Vladimir Volovich [EMAIL PROTECTED] writes:

 when using the -S option, wget dies apparently because the server
 returns 8-bit characters in the WWW-Authenticate header:
[...]

Thank for the report and the test case.  This patch fixes the problem:

2007-02-02  Hrvoje Niksic  [EMAIL PROTECTED]

* http.c (print_server_response): Escape non-printable characters
in server respone.

Index: src/http.c
===
--- src/http.c  (revision 2202)
+++ src/http.c  (working copy)
@@ -738,6 +738,20 @@
   xfree (resp);
 }
 
+/* Print a single line of response, the characters [b, e).  We tried
+   getting away with
+  logprintf (LOG_VERBOSE, %s%.*s\n, prefix, (int) (e - b), b);
+   but that failed to escape the non-printable characters and, in fact,
+   caused crashes in UTF-8 locales.  */
+
+static void
+print_response_line(const char *prefix, const char *b, const char *e)
+{
+  char *copy;
+  BOUNDED_TO_ALLOCA(b, e, copy);
+  logprintf (LOG_VERBOSE, %s%s\n, prefix, escnonprint(copy));
+}
+
 /* Print the server response, line by line, omitting the trailing CRLF
from individual header lines, and prefixed with PREFIX.  */
 
@@ -756,9 +770,7 @@
 --e;
   if (b  e  e[-1] == '\r')
 --e;
-  /* This is safe even on printfs with broken handling of %.ns
- because resp-headers ends with \0.  */
-  logprintf (LOG_VERBOSE, %s%.*s\n, prefix, (int) (e - b), b);
+  print_response_line(prefix, b, e);
 }
 }


Re: Output inconsistency

2007-01-25 Thread Hrvoje Niksic
Nejc ┼koberne [EMAIL PROTECTED] writes:

 [EMAIL PROTECTED]:~# wget -O /dev/null http://10.0.0.2/testsmall.dat 21 | 
 grep saved
 10:38:13 (86,22 MB/s) - `/dev/null' saved [21954560/21954560]
 [EMAIL PROTECTED]:~# wget -O /dev/null ftp://testuser:[EMAIL 
 PROTECTED]/testsmall.dat 21 | grep saved
 10:38:18 (90.47 MB/s) - `/dev/null' saved [21954560]

 In the HTTP case, the final average transfer speed number is
 delimited by a comma, while in the FTP case it is delimited by a
 dot. Is this a bug/inconsistency?

Which version of Wget are you using, and in which locale?  I cannot
repeat that bug with either Wget 1.10.2 or with the latest Wget from
the subversion repository.  (I tested in the hr locale, which uses
, as a decimal separator.)


Re: wget css parsing, updated to trunk

2007-01-23 Thread Hrvoje Niksic
Ted Mielczarek [EMAIL PROTECTED] writes:

 Is there any interest in this?

Sorry for answering this late.  I, for one, find it very interesting.
Fetching CSS would be a very welcome feature.


  1   2   3   4   5   6   7   8   9   10   >