from:"Ian Abbott"

Re: Wget 1.8.2 is released

2002-05-30 Thread Ian Abbott


On Thu, 30 May 2002 03:43:06 +0200, Hrvoje Niksic
[EMAIL PROTECTED] wrote:

Ian Abbott [EMAIL PROTECTED] writes:
 This is a bit late,

Sorry it didn't make it in.  I guess we could publish it on the web
site, so that people who wish to compile 1.8.2 with Borland C++ can do
so.  Heiko's Wget on Windows page is another good place to link to
this patch.

I'll clean it up a bit.

 +# ifdef NO_ANONYMOUS_STRUCT
 +  wt-wintime.u.HighPart = ft.dwHighDateTime;
 +  wt-wintime.u.LowPart  = ft.dwLowDateTime;
 +# else
wt-wintime.HighPart = ft.dwHighDateTime;
wt-wintime.LowPart  = ft.dwLowDateTime;
 +# endif

Isn't anonymous struct a C++ feature?  (I'm only guessing here.)

Yes, but some C compilers support it as an extension.

Would wt-wintime.u.HighPart work under both compilers?  I'm just
asking as someone who would like to see the number of #ifdefs decrease
rather than increase.

Microsoft only document the anonymous form in their Win32 SDK, which is
why I'm hesitant to take it out altogether.  However, the undocumented,
non-anonymous u. form does seem to work uniformly, at least with the
Microsoft, Borland and Watcom compilers I've tried.

Re: Wget 1.8.2 is released

2002-05-29 Thread Ian Abbott


On Wed, 29 May 2002 05:14:14 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote:

Wget 1.8.2, a bugfix release of Wget, has been released, and is now
available from the GNU ftp site:

ftp://ftp.gnu.org/pub/gnu/wget/wget-1.8.2.tar.gz

This is a bit late, but here is a patch to compile it with Borland C++
4.5 (compiler version 4.5.2).  With a small change to the Makefile to
select a different linker, it also compiles with Borland C++ 5.5
(compiler version 5.5.1).  The Makefile to change is
windows/Makefile.src.bor before running configure --borland, or
alternatively change src/Makefile after running configure --borland.

diff -ru wget-1.8.2/src/utils.c wget-1.8.2.new/src/utils.c
--- wget-1.8.2/src/utils.c  Sat May 18 04:05:22 2002
+++ wget-1.8.2.new/src/utils.c  Mon May 27 19:44:40 2002
@@ -1504,8 +1504,13 @@
   SYSTEMTIME st;
   GetSystemTime (st);
   SystemTimeToFileTime (st, ft);
+# ifdef NO_ANONYMOUS_STRUCT
+  wt-wintime.u.HighPart = ft.dwHighDateTime;
+  wt-wintime.u.LowPart  = ft.dwLowDateTime;
+# else
   wt-wintime.HighPart = ft.dwHighDateTime;
   wt-wintime.LowPart  = ft.dwLowDateTime;
+# endif
 #endif
 }
 
@@ -1533,8 +1538,13 @@
   ULARGE_INTEGER uli;
   GetSystemTime (st);
   SystemTimeToFileTime (st, ft);
+# ifdef NO_ANONYMOUS_STRUCT
+  uli.u.HighPart = ft.dwHighDateTime;
+  uli.u.LowPart = ft.dwLowDateTime;
+# else
   uli.HighPart = ft.dwHighDateTime;
   uli.LowPart = ft.dwLowDateTime;
+# endif
   return (long)((uli.QuadPart - wt-wintime.QuadPart) / 1);
 #endif
 }
diff -ru wget-1.8.2/windows/Makefile.src.bor wget-1.8.2.new/windows/Makefile.src.bor
--- wget-1.8.2/windows/Makefile.src.bor Tue Dec  4 10:33:18 2001
+++ wget-1.8.2.new/windows/Makefile.src.bor Wed May 29 12:20:51 2002
@@ -2,17 +2,25 @@
 ## Makefile for use with watcom win95/winnt executable.
 
 CC=bcc32
+
+## Please choose the linker used by your compiler
+
+## Linker for Borland C++ 5.5
+#LINK=ilink32
+
+## Linker for Borland C++ 4.5
 LINK=tlink32
 
 LFLAGS=
-CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w-
+CFLAGS=-DWINDOWS=1 -DHAVE_CONFIG_H -I. -H -H=wget.csm -w-
 
 ## variables
-OBJS=cmpt.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj  \
-  ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html.obj \
-  http.obj init.obj log.obj main.obj gnu-md5.obj netrc.obj rbuf.obj  \
-  alloca.obj \
-  recur.obj res.obj retr.obj url.obj utils.obj version.obj mswindows.obj
+OBJS=cmpt.obj safe-ctype.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj  \
+  ftp-ls.obj ftp-opie.obj getopt.obj hash.obj headers.obj html-parse.obj \
+  html-url.obj progress.obj host.obj cookies.obj http.obj init.obj \
+  log.obj main.obj gen-md5.obj gnu-md5.obj netrc.obj rbuf.obj \
+  snprintf.obj recur.obj res.obj retr.obj url.obj utils.obj version.obj \
+  mswindows.obj
 
 LIBDIR=$(MAKEDIR)\..\lib
 
@@ -20,7 +28,7 @@
   $(LINK) @|
 $(LFLAGS) -Tpe -ap -c +
 $(LIBDIR)\c0x32.obj+
-alloca.obj+
+snprintf.obj+
 version.obj+
 utils.obj+
 url.obj+
@@ -37,9 +45,10 @@
 log.obj+
 init.obj+
 http.obj+
-html.obj+
 host.obj+
 headers.obj+
+html-parse.obj+
+html-url.obj+
 getopt.obj+
 ftp-opie.obj+
 ftp-ls.obj+
@@ -47,7 +56,10 @@
 ftp.obj+
 fnmatch.obj+
 connect.obj+
-cmpt.obj
+cmpt.obj+
+hash.obj+
+cookies.obj+
+safe-ctype.obj
 $,$*
 $(LIBDIR)\import32.lib+
 $(LIBDIR)\cw32.lib
diff -ru wget-1.8.2/windows/config.h.bor wget-1.8.2.new/windows/config.h.bor
--- wget-1.8.2/windows/config.h.bor Sat May 18 04:05:28 2002
+++ wget-1.8.2.new/windows/config.h.bor Wed May 29 12:22:46 2002
@@ -1,5 +1,6 @@
 /* Configuration header file.
-   Copyright (C) 1995, 1996, 1997, 1998 Free Software Foundation, Inc.
+   Copyright (C) 1995, 1996, 1997, 1998, 2001, 2002
+   Free Software Foundation, Inc.
 
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -29,36 +30,23 @@
 #ifndef CONFIG_H
 #define CONFIG_H
 
-/* Define if you have the alloca.h header file.  */
-#undef HAVE_ALLOCA_H
+#define ftruncate chsize
 
-/* AIX requires this to be the first thing in the file.  */
-#ifdef __GNUC__
-# define alloca __builtin_alloca
-#else
-# if HAVE_ALLOCA_H
-#  include alloca.h
-# else
-#  ifdef _AIX
- #pragma alloca
-#  else
-#   ifndef alloca /* predefined by HP cc +Olibcalls */
-char *alloca ();
-#   endif
-#  endif
-# endif
-#endif
+/* mswindows.h defines vsnprintf as _vsnprintf and snprintf as _snprintf
+   so work around that here.  This is a temporary hack.  The defines
+   in mswindows.h should be moved into config.h.ms.  */
+#define _vsnprintf vsnprintf
+#define _snprintf snprintf
 
-/* Define if on AIX 3.
-   System headers sometimes define this.
-   We just want to avoid a redefinition error message.  */
-#ifndef _ALL_SOURCE
-/* #undef _ALL_SOURCE */
-#endif
+/* Define if you have the alloca.h header file.  */
+#undef HAVE_ALLOCA_H
 
 /* Define to empty if the keyword does not work.  */
 /* #undef const */
 
+/* Define to empty or

Re: query compiling wget 1.8.1 on Borland C++ 4.5

2002-05-27 Thread Ian Abbott


On Sat, 25 May 2002 19:03:45 +0200, Hrvoje Niksic
[EMAIL PROTECTED] wrote:

Ian Abbott [EMAIL PROTECTED] writes:

 The 1.8.2 branch is pretty similar to 1.8.1 at the moment and
 doesn't compile with any version of Borland C++.

Should we care to fix that before the release?  I'm not sure how
important error-free compilation under Borland is.

I could attempt to get it to compile on Borland C++ 4.5.  I'm not sure
which previous releases compiled okay with that compiler, though.

The main branch recently compiled okay with a later Borland compiler
(Borland C++ 5.5.1) thanks to Chin-yuan Kuo. (This compiler was
originally part of Borland's C++ Builder package, but is now available
as a free (as in beer) download from Borland.)  This compile is also
broken at the moment, but just needs WGET_USE_STDARG defining in
config.h.bor.  I'll add that change to the main branch shortly.

It would be easier to just apply this change to 1.8.2 than to make 1.8.2
compile with the older compiler package, I think, but I'll try and
compile it with the older compiler and see if I can get anywhere with
it.

FWIW, the 1.8.2 branch compiles fine with the Watcom C++ 11.0 compiler.

Re: win32: how to send wget output to console and a log-file?

2002-05-27 Thread Ian Abbott


On Fri, 24 May 2002 20:34:38 +0400, Valery Kondakoff [EMAIL PROTECTED]
wrote:

I'm not sure I understand what exactly '21' means. As far as I
understand '' is a redirection sign. So - '1' means stdout and '2'
means stderr?

They refer to the three standard file descriptors - 0 is standard input
(stdin), 1 is standard output (stdout), 2 is standard error output
(stderr). The '21' means 'redirect standard error output to standard
output'. This results in standard output becoming a combination of
standard output and standard error output.  There are other things you
can do with them, such as:

 1file  (stdout goes to file, same as file)
 2file  (stderr goes to file)
 file 2errfile  (stdout goes to file, stderr goes to errfile)

Some other combinations where the order matters:

 file 21  (both stderr and stdout go to file)
 21 file  (stdout goes to file, stderr goes to stdout)
 21 file | command  (stdout goes to file, stderr piped to command)

BTW - if there are some plans to enhance wget logging possibilities?

On a different thread (back in April) I suggested the following:

|Perhaps we just need a --log-level=N option:
|
|Level 0: output just the LOG_ALWAYS messages.
|Level 1: output the above and LOG_NOTQUIET messages.
|Level 2: output the above and LOG_NONVERBOSE messages.
|Level 3: output the above and LOG_VERBOSE messages.
|
|The --verbose option would be equivalent to --log-level=3 (the
|default).
|
|The --non-verbose option would be equivalent to --log-level=2.
|
|The --quiet option would be equivalent to --log-level=1.

However, I made a mistake in the above; the last line should have read:

The --quiet option would be equivalent to --log-level=0.

This means that none of the other options would be equivalent to
--log-level=1.  I suppose a --non-quiet option could be added for
completeness, but the names of these options would be more horribly
confused than they are at the moment.  it would not be immediately
obvious that the order of verbosity would then run: --quiet,
--non-quiet, --non-verbose, --verbose.

Re: tag v:shapes ???

2002-05-27 Thread Ian Abbott


On Mon, 27 May 2002 16:22:57 +0200, Hrvoje Niksic
[EMAIL PROTECTED] wrote:

Jacques Beigbeder [EMAIL PROTECTED] writes:

 I ran into a trouble with:
  wget -m http://some/site
 because of a line like:
  img src=a.gif v:shapes=...
 v:shapes contains a character ':', so a.gif isn't mirrored.

Thanks for the report.  I think I'll make NAME_CHAR_P much more
forgiving about the type of characters it uses.  Doing anything else
is counter-productive, because too many pages use or leak weird
characters in attribute names.

This particular weird character looks like it's due to the use of XML
namespaces.  The colon separates the namespace prefix from the remainder
of the attribute name.

Re: win32: how to send wget output to console and a log-file?

2002-05-24 Thread Ian Abbott


On Fri, 24 May 2002 15:41:01 +0400, Valery Kondakoff [EMAIL PROTECTED]
wrote:

Hello, Herold!

24 ìàÿ 2002 ã., you wrote to me:

HH You could do something like tail -f on the logfile if you have a similar
HH program installed, or log to output and | tee logfile, but all of those
HH require another command.

Thank you for your answer.

I downloaded two win32 'tee' ports, and they works as expected when
I'm entering in command line something like this: 'wget.exe -V |
tee.exe wget.log', but after I enter 'wget.exe http://someurl.com |
tee.exe wget.log' the 'wget.log' file remains empty... What is wrong?
(WinXP Pro, GNU Wget 1.8.1+cvs).

The '|' only redirects standard output, but wget writes to standard
error output.  To capture standard error output, you need a little
utility that launches another program while capturing standard error
output.  While I was working somewhere else, I used a program called
ftee (or ftee32) to do this, but I don't have a copy.  The only
references that I can find to this utility on the web indicate that it
is part of Starbase's CodeWright product. (It's possible to download
an evaluation version of this, but t seems a large download if you just
want the little ftee utility, and legally, you shouldn't use it after
the evaluation period expires.)

Maybe Wget should have a -o - option to send logging output to
standard output.

Re: win32: how to send wget output to console and a log-file?

2002-05-24 Thread Ian Abbott


On Fri, 24 May 2002 08:03:15 -0700 (PDT), Doug Kaufman
[EMAIL PROTECTED] wrote:

On Fri, 24 May 2002, Valery Kondakoff wrote:

 I downloaded two win32 'tee' ports, and they works as expected when
 I'm entering in command line something like this: 'wget.exe -V |
 tee.exe wget.log', but after I enter 'wget.exe http://someurl.com |
 tee.exe wget.log' the 'wget.log' file remains empty... What is wrong?
 (WinXP Pro, GNU Wget 1.8.1+cvs).

Wget sends to stderr by default. Try wget -o - |tee wget.log. This
should send output to stdout, which tee can then handle.

That doesn't work.  It just creates a file called -.

Interestingly, I've just found out that Win NT's default command-line
shell (cmd.exe) supports Unix-style redirectors.  So you can use:

  C:\wget http://someurl.com 21 | tee wget.log

That should work on Windows NT, 2000 and XP but won't work on Windows
95, 98 or ME as it uses a different comamnd-line shell (command.com).

Re: query compiling wget 1.8.1 on Borland C++ 4.5

2002-05-23 Thread Ian Abbott


On Wed, 22 May 2002 18:04:34 +0200, Herold Heiko
[EMAIL PROTECTED] wrote:

Latest cvs should compile correctly with borland compilers.

The latest CVS (main branch) should compile correctly with Borland C++
5.52 (which is a free download from Borland's site), but will not
compile with earlier versions.

Or, the upcoming 1.8.2 release.

The 1.8.2 branch is pretty similar to 1.8.1 at the moment and doesn't
compile with any version of Borland C++.

Re: Wget 1.8.2-pre1 ready for testing

2002-05-22 Thread Ian Abbott


On Tue, 21 May 2002 19:24:01 +0200, Hrvoje Niksic
[EMAIL PROTECTED] wrote:

[Windows '?' problem]
Ian, feel free to apply the necessary change to the 1.8.2 branch.

Okay, I'll do it after work today.  I've been a little busy the last few
days!

Re: Wget 1.8.2-pre1 ready for testing

2002-05-21 Thread Ian Abbott


On Tue, 21 May 2002 06:04:59 +0200, Hrvoje Niksic
[EMAIL PROTECTED] wrote:

As promised, here comes the first (and hopefully only) pre-test for
the 1.8.2 bugfix release.  Get it from:

http://fly.srk.fer.hr/~hniksic/wget-1.8.2-pre2.tar.gz

Windows versions will still have problems saving filenames with the
query character '?' in them.  Should we introduce a temporary change to
remap this to something else (e.g. '@') in the Windows version of Wget
1.8.2?

Re: FTP wildcards

2002-05-17 Thread Ian Abbott


On Fri, 17 May 2002 11:24:25 +0100, Ian Abbott [EMAIL PROTECTED]
wrote:

On Fri, 17 May 2002 08:34:27 +0200, Jan Klepac [EMAIL PROTECTED]
wrote:
I'd like to download all archive files wn16pcm.r[0..9][0..9] from the
directory on ftp server but
wget --passive-ftp ftp://ftp.ims.uni-stuttgart.de/pub/WordNet/1.6/wn16pcm.r*
doesn't work and I cannot find what is wrong.

Wget doesn't like the foreign dates in the directory listings.

Any advice appreciated.

I forgot to mention that you could try a different WordNet mirror.  This
doesn't solve the Wget problem, but Wget should cope better if you use
an FTP server that uses English dates in the FTP listings.

Re: gopher support?

2002-05-17 Thread Ian Abbott


On Fri, 17 May 2002 12:41:21 +0200, Stephan Beyer [EMAIL PROTECTED]
wrote:

not interested in adding the Gopher feature to wget or should I still wait 
some time?

I have no objections to adding gopher support, but it's up to the main
developer (Hrvoje Niksic) whether it ends up in GNU Wget.  I think he's
a bit busy with his real job at the moment.

I think your gopher code is still in its early phases of development at
the moment.  Maybe when some of your planned extra functionality is
added it will stand a better chance of being accepted.  In general,
options should work the same for gopher as they do for http and ftp as
much as possible.  Since most of the patch is self-contained in
gopher.c, you should be able to continue working on it without being
affected by other changes in CVS too much.

One minor comment about source layout: you have made some effort to
conform to the GNU style, but your tabs are a bit screwy.  Tabs should
be 8 spaces, but indents before and after brackets should be 2 spaces.

Re: question about wget flavor

2002-05-17 Thread Ian Abbott


On Fri, 17 May 2002 16:59:07 +0400, Pavel Stepchenko [EMAIL PROTECTED]
wrote:

#!/bin/sh
wget=/usr/local/bin/wget -t0 -nr -nc -x --timeout=20 --wait=61 --waitretry=120
$wget ftp://nonanonymous:[EMAIL PROTECTED]/file1.zip

sleep 60

$wget ftp://nonanonymous:[EMAIL PROTECTED]/file2.zip

Why WGET can make a pause between 1st and 2nd retrieval?

See the sleep command above.  Nothing to do with Wget in this case!

Re: cookie pb: download one file on member area

2002-05-16 Thread Ian Abbott


On Wed, 15 May 2002 23:41:39 +0200, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:

  Hi, I generate de file you wanted (with -d option). I also
used --load-cookies option.
  The generated file can be found at:
  http://bigben.pointclark.net/~bertra_b/wget_debug

 Note: I replaced the values of the cookies by  (or X).

  Hope this will help. If you want me to go further more, tell me.

I guess the Attempt to fake the domain messages in the debug log are
the major clue as to why it isn't working.

The main maintainer of Wget (Hrvoje Niksic) changed the cookie domain
matching rules recently and the new rules should work better.  The new
rules are implemented in the current development version of Wget
available by anonymous CVS.

Re: wget paramter -

2002-05-16 Thread Ian Abbott


On Thu, 16 May 2002 12:22:42 +0200, Gurkan Sengun
[EMAIL PROTECTED] wrote:

what about this parameter
With no FILE, or when FILE is -, read standard input.
(read url's actually)

This is not a bug.  Please use [EMAIL PROTECTED] for feature requests.

It's a nice idea, but rather than `-' it should be `-i -' as that is
more consistent with the existing `-O -' usage.

Re: bug report and patch, HTTPS recursive get

2002-05-15 Thread Ian Abbott


On Wed, 15 May 2002 18:44:19 +0900, Kiyotaka Doumae [EMAIL PROTECTED]
wrote:

I found a bug of wget with HTTPS resursive get, and proposal
a patch.

Thanks for the bug report and the proposed patch.  The current scheme
comparison checks are getting messy, so I'll write a function to check
schemes for similarity (when I can spare the time later today).

Re: question on printing to screen

2002-05-13 Thread Ian Abbott


On 12 May 2002 02:54:52 -0500, asher [EMAIL PROTECTED] wrote:

hi, I've been trying to figure out how wget prints all over the screen
with out using curses, and I'm hoping someone can help.  from the code,
I'm pretty sure it's just printing to the C-stream stderr, but I can't
for the life of me figure out how it seeks or jumps around in the
stream.  any help would be appreciated.

I assume you are referring to the progress bar.  It just outputs a
carriage return to return to the beginning of the current line without
doing a linefeed.

Re: Why must ftp_proxy be HTTP?

2002-05-07 Thread Ian Abbott


On Tue, 7 May 2002 17:18:57 +0800 , Fung Chai [EMAIL PROTECTED]
wrote:

I went through the source code (src/retr.c) of wget-1.8.1 and notice that
the ftp_proxy must be HTTP; the user cannot specify it as ftp://proxy:port.
In the direct mode (ie, use_proxy is set to false), retrieve_url() will use
the FTP protocol to retrieve a file, but will use the HTTP protocol to
retrieve the file via the proxy.

Please try the current development version of Wget from the CVS
repository, as this has support for FTP gateway proxy servers
(FWTK-style proxies, according to the ChangeLog), but the
functionality needs testing.

Re: Bug report

2002-05-04 Thread Ian Abbott


On Fri, 3 May 2002 18:37:22 +0200, Emmanuel Jeandel
[EMAIL PROTECTED] wrote:

ejeandel@yoknapatawpha:~$ wget -r a:b
Segmentation fault

Patient: Doctor, it hurts when I do this
Doctor: Well don't do that then!

Seriously, this is already fixed in CVS.

Re: problem: illegal f.s. chars in links

2002-05-04 Thread Ian Abbott


On Fri, 3 May 2002 14:14:37 +0200 , [EMAIL PROTECTED] wrote:

  Cannot write to
`www.travelocity.com/Vacations/0,,TRAVELOCITY||Y,00.html@HPTRACK=icon_vac'
  (No such file or directory).

Presumably this happens because the pipes, in particular, are illegal chars
for a filename. So my question is:

Correct.  At least when running on Windows.

   Is there any chance of adding an option to translate illegal
characters into legal ones
   both in filenames and in the links to those files?

There are plans to make sure that the desired filenames get mapped
to legal ones before Wget 1.9 is released, but no specific timescale
that I'm aware of.  There might be some options to fine tune the set
of illegal characters, but the default set of illegal characters
will vary depending on the platform.  That should solve most
problems with filenames on Windows, but probably won't deal with
issues such as clashes with DOS device names such as com1, prn,
nul etc., particularly as the standard set of such names can be
extended willy-nilly!

Links will be converted to relative links to downloaded files using
their converted filenames with the -k option.

(You may notice that although I'm forced to use Outlook on Windows NT to
write this mail, I'm using bash and wget to do the actual work; hopefully
this will improve the standing of Free software within this company, or
within the QA teams at least.)

With Cygwin?

One possibility may be to run the tests on Linux, assuming your
Linux product uses the same virus scanning algorithms and patterns
as your Windows product.

I'd do this myself if I knew how to write C: alas, I'm a Perl monkey myself.

I thought the term was monger?

Re: W32.Klez.E removal tools

2002-05-02 Thread Ian Abbott


On Wed, 1 May 2002 22:12:08 +0300, robots [EMAIL PROTECTED]
wrote:

HTMLHEAD/HEADBODY

FONTF-Secure give you the W32.Klez.E removal toolsbr
W32.Klez.E is a  dangerous virus that spread through email.br
br
For more information,please visit http://www.F-Secure.com/FONT/BODY/HTML

Just in case there are one or two people stupid people out there who
take everything at face value, the attachment to the above message
is not a disinfectant for any virus.  It should come as no surprise
to most people that it is in fact intended to infect your computer's
Microsoft Email program with a virus (Klez.H).

The real disinfectant for this virus can be found here:
ftp://ftp.f-secure.com/anti-virus/tools/fsklez.exe.

For more info on this tool, see:
ftp://ftp.f-secure.com/anti-virus/tools/fsklez.txt.

For other free virus removal programs from the same site, see:
http://www.f-secure.com/download-purchase/tools.shtml.

For more info on the Klez worms, see:
http://www.europe.f-secure.com/v-descs/klez.shtml.

Re: HREF=//domain.com

2002-04-29 Thread Ian Abbott


On Mon, 29 Apr 2002 12:03:23 -0500 (CDT), you wrote:

While using wget with www.slashdot.org, the site makes use of HREF's in 
the following manner 'A HREF=//slashdot.org/image.gif'.  It appears 
that when wget is following the link, it is then looking for 
http://www.slashdot.org//slashdot.org/image.gif; which is incorrect.

That's fixed in CVS, so you can either build and install the version
in CVS or wait for the next official release.  Or just apply this
patch to Wget 1.8.1:

Index: src/url.c
===
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.67
retrieving revision 1.68
diff -u -r1.67 -r1.68
--- src/url.c   2001/12/14 15:45:59 1.67
+++ src/url.c   2002/01/14 01:56:40 1.68
 -1575,6 +1575,37 
  memcpy (constr + baselength, link, linklength);
  constr[baselength + linklength] = '\0';
}
+  else if (linklength  1  *link == '/'  *(link + 1) == '/')
+   {
+ /* LINK begins with // and so is a net path: we need to
+replace everything after (and including) the double slash
+with LINK. */
+
+ /* uri_merge(foo, //new/bar)- //new/bar  */
+ /* uri_merge(//old/foo, //new/bar)  - //new/bar  */
+ /* uri_merge(http://old/foo;, //new/bar) - http://new/bar; */
+
+ int span;
+ const char *slash;
+ const char *start_insert;
+
+ /* Look for first slash. */
+ slash = memchr (base, '/', end - base);
+ /* If found slash and it is a double slash, then replace
+from this point, else default to replacing from the
+beginning.  */
+ if (slash  *(slash + 1) == '/')
+   start_insert = slash;
+ else
+   start_insert = base;
+
+ span = start_insert - base;
+ constr = (char *)xmalloc (span + linklength + 1);
+ if (span)
+   memcpy (constr, base, span);
+ memcpy (constr + span, link, linklength);
+ constr[span + linklength] = '\0';
+   }
   else if (*link == '/')
{
  /* LINK is an absolute path: we need to replace everything

Re: segmentation fault on bad url

2002-04-23 Thread Ian Abbott


On 22 Apr 2002 at 21:38, Renaud Saliou wrote:

 Hi,
 
   wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after 
 -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com 
 
 DEBUG output created by Wget 1.8.1 on linux-gnu.
 
 zsh: segmentation fault  wget -t 3 -d -r -l 3 -H --random-wait -nd 
 --delete-after

It looks like this has been fixed in the current CVS version
(actually a few days old):

$ wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after \
-A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com
DEBUG output created by Wget 1.8.1+cvs on linux-gnu.

http://http://www.microsoft.com: Bad port number.

FINISHED --10:36:45--
Downloaded: 0 bytes in 0 files

Re: add tar option

2002-04-23 Thread Ian Abbott


On 23 Apr 2002 at 18:19, Hrvoje Niksic wrote:

 On technical grounds, it might be hard to shoehorn Wget's mode of
 operation into what `tar' expects.  For example, Wget might need to
 revisit directories in random order.  I'm not sure if a tar stream is
 allowed to do that.

You can add stuff to a tar stream in a pretty much random order -
that's effectively what you get when you use tar's -r option to
append to the end of an existing archive.  (I used to use that with
tapes quite often, once upon a time.)

Re: ScanMail Message: To Recipient virus found or matched file blocki ng setting.

2002-04-19 Thread Ian Abbott


On 19 Apr 2002 at 10:42, Daniel Stenberg wrote:

 On Fri, 19 Apr 2002, System Attendant wrote:
 
  ScanMail for Microsoft Exchange has taken action on the message, please
  refer to the contents of this message for further details.
 
 Please.
 
 Can the admin of this ScanMail stop polluting this list even more?

Looks like it's been configured to notify everyone associated with
the email about the virus.

 Can the admin of the wget list please prevent his mails from showing up here?

I suppose one such notification could be deemed useful, but if
several such notifications for arrive from various ScanMails for
every virus it would be more of a PITA.

Of course it would be even more of a PITA if ScanMail was sending
such notifcations to everybody concerned for emails that *might*
contain an unknown virus due to some policy setting (e.g. email
contains a .bat attachment).

 We don't need replies on all spam mails telling us that the spam contained
 viruses.

One could be useful, assuming people recognize the ScanMail
messages and read them *before* the infected mail! (Hmm - that
looks like a good way to disguise a virus - make it look like an
anti-virus notification!)

Of course, we wouldn't get so many if the list filtered viruses as
effectively as ScanMail!

Re: Validating cookie domains

2002-04-19 Thread Ian Abbott


On 19 Apr 2002 at 16:30, Hrvoje Niksic wrote:

 To quote from there:
 
 [...] Only hosts within the specified domain can set a cookie for
 a domain and domains must have at least two (2) or three (3)
 periods in them to prevent domains of the form: .com, .edu,
 and va.us. Any domain that fails within one of the seven special
 top level domains listed below only require two periods. Any other
 domain requires at least three. The seven special top level
 domains are: COM, EDU, NET, ORG, GOV, MIL, and INT.
 
 This is amazingly stupid.

It seems to make more sense if you subtract one from the number of
periods.

 It means that `www.arsdigita.de' cannot set
 the cookie for `arsdigita.de'.  To make *that* work, you'd have to
 maintain a database of domains that use .co.xxx convention, as
 opposed to those that use just .xxx.

Could you assume that all two-letter TLDs are country-code TLDs and
require one more period than other TLDs (which are presumably at
least three characters long)?

Re: wget-1.8.1: build failure on SGI IRIX 6.5 with c89

2002-04-12 Thread Ian Abbott


On 11 Apr 2002 at 18:55, Nelson H. F. Beebe wrote:

  what happens if you configure it with the option
  --x-includes=/usr/local/include ?
 
 On SGI IRIX 6.5, in a clean directory, I unbundled wget-1.8.1.tar.gz,
 and did this:
 
   % env CC=c89 ./configure --x-includes=/usr/local/include
 
   % grep HAVE_NLS src/config.h
   #define HAVE_NLS 1
 
   % grep HAVE_LIBINTL_H src/config.h
   /* #undef HAVE_LIBINTL_H */

Okay so --x-includes didn't achieve much. I thought the x might
stand for 'extra', but I guess it must be for the X Window System,
and therefore irrelevant to Wget.

How about:
% env CC=c89 CPPFLAGS='-I/usr/local/include' ./configure

There's got to be some way to get this thing to build!

I just tried moving libintl.h into /usr/local/include on my machine
and doing something similar:

bash$ CC=cc CPPFLAGS='-I/usr/local/include' ./configure

and it managed to set both HAVE_NLS and HAVE_LIBINTL_H in the
resulting src/config.h and it managed to build okay.

Re: Referrer Faking and other nifty features

2002-04-12 Thread Ian Abbott


On 12 Apr 2002 at 17:21, Thomas Lussnig wrote:

 So that if one fd become -1 the loader take an new url and initate the 
 download.
 
 And than shedulingwould work with the select(int,) what about this 
 idee ?

It would certainly make handling the logging output a bit of a
challenge, especially the progress indication.

Re: No clobber and .shtml files

2002-04-12 Thread Ian Abbott


On 11 Apr 2002 at 21:00, Hrvoje Niksic wrote:

 This change is fine with me.  I vaguely remember that this test is
 performed in two places; you might want to create a function.

Certainly. Where's the best place for it? utils.c?

Re: No clobber and .shtml files

2002-04-12 Thread Ian Abbott


On 11 Apr 2002 at 21:00, Hrvoje Niksic wrote:

 This change is fine with me.  I vaguely remember that this test is
 performed in two places; you might want to create a function.

I've found three places where it checks the suffix, so I called a
new function in all three places for consistency. One of those
places performed a case-insensitive comparison so I made my
function do that too.

Hrvoje, you may wish to review whether checking the new extensions
in all three places (but particularly recur.c) is a good idea or
not before I commit the patch.

src/ChangeLog entry:

2002-04-12  Ian Abbott  [EMAIL PROTECTED]

* utils.c (has_html_suffix_p): New function to text filename for common
html extensions.

* utils.h: Declare it.

* http.c (http_loop): Use it instead of previous test.

* retr.c (retrieve_url): Ditto.

* recur.c (download_child_p): Ditto.

Index: src/http.c
===
RCS file: /pack/anoncvs/wget/src/http.c,v
retrieving revision 1.86
diff -u -r1.86 http.c
--- src/http.c  2002/04/11 17:49:32 1.86
+++ src/http.c  2002/04/12 17:35:02
@@ -1405,7 +1405,7 @@
   int use_ts, got_head = 0;/* time-stamping info */
   char *filename_plus_orig_suffix;
   char *local_filename = NULL;
-  char *tms, *suf, *locf, *tmrate;
+  char *tms, *locf, *tmrate;
   uerr_t err;
   time_t tml = -1, tmr = -1;   /* local and remote time-stamps */
   long local_size = 0; /* the size of the local file */
@@ -1465,9 +1465,8 @@
   *dt |= RETROKF;
 
   /*  Bogusness alert.  */
-  /* If its suffix is html or htm, assume text/html.  */
-  if (((suf = suffix (*hstat.local_file)) != NULL)
-  (!strcmp (suf, html) || !strcmp (suf, htm)))
+  /* If its suffix is html or htm or similar, assume text/html.  */
+  if (has_html_suffix_p (*hstat.local_file))
*dt |= TEXTHTML;
 
   FREE_MAYBE (dummy);
Index: src/recur.c
===
RCS file: /pack/anoncvs/wget/src/recur.c,v
retrieving revision 1.43
diff -u -r1.43 recur.c
--- src/recur.c 2002/02/19 06:09:57 1.43
+++ src/recur.c 2002/04/12 17:35:02
@@ -510,7 +510,6 @@
 
   /* 6. */
   {
-char *suf;
 /* Check for acceptance/rejection rules.  We ignore these rules
for HTML documents because they might lead to other files which
need to be downloaded.  Of course, we don't know which
@@ -521,14 +520,13 @@
* u-file is not  (i.e. it is not a directory)
and either:
  + there is no file suffix,
-+ or there is a suffix, but is not html or htm,
++ or there is a suffix, but is not html or htm or similar,
 + both:
   - recursion is not infinite,
   - and we are at its very end. */
 
 if (u-file[0] != '\0'
-((suf = suffix (url)) == NULL
-   || (0 != strcmp (suf, html)  0 != strcmp (suf, htm))
+(!has_html_suffix_p (url)
|| (opt.reclevel != INFINITE_RECURSION  depth = opt.reclevel)))
   {
if (!acceptable (u-file))
Index: src/retr.c
===
RCS file: /pack/anoncvs/wget/src/retr.c,v
retrieving revision 1.50
diff -u -r1.50 retr.c
--- src/retr.c  2002/01/30 19:12:20 1.50
+++ src/retr.c  2002/04/12 17:35:03
@@ -384,12 +384,11 @@
 
   /* There is a possibility of having HTTP being redirected to
 FTP.  In these cases we must decide whether the text is HTML
-according to the suffix.  The HTML suffixes are `.html' and
-`.htm', case-insensitive.  */
+according to the suffix.  The HTML suffixes are `.html',
+`.htm' and a few others, case-insensitive.  */
   if (redirection_count  local_file  u-scheme == SCHEME_FTP)
{
- char *suf = suffix (local_file);
- if (suf  (!strcasecmp (suf, html) || !strcasecmp (suf, htm)))
+ if (has_html_suffix_p (local_file))
*dt |= TEXTHTML;
}
 }
Index: src/utils.c
===
RCS file: /pack/anoncvs/wget/src/utils.c,v
retrieving revision 1.44
diff -u -r1.44 utils.c
--- src/utils.c 2002/01/17 01:03:33 1.44
+++ src/utils.c 2002/04/12 17:35:03
@@ -792,6 +792,30 @@
 return NULL;
 }
 
+/* Checks whether a filename is has a typical HTML suffix or not. The
+   following suffixes are presumed to be html files (case insensitive):
+   
+ html
+ htm
+ ?html (where ? is any character)
+
+   This is not necessarily a good indication that the file actually contains
+   HTML!  */
+int has_html_suffix_p (const char *fname)
+{
+  char *suf;
+
+  if ((suf = suffix (fname)) == NULL)
+return 0;
+  if (!strcasecmp (suf, html))
+return 1;
+  if (!strcasecmp (suf, htm))
+return 1;
+  if (suf[0]  !strcasecmp (suf + 1, html))
+return 1;
+  return 0;
+}
+
 /* Read a line from FP and return the pointer

Re: Your Mailing List Subscription

2002-04-12 Thread Ian Abbott


On 12 Apr 2002 at 14:12, [EMAIL PROTECTED] wrote:

 IGaming Exchange and IGaming News News Letter information
 You have chosen to remove yourself from all of the IGaming Exchange 
 and IGaming News email list. If you have any questions or comments 
 about the news letters please feel free to contact 
 [EMAIL PROTECTED]
 Thank you,
 The River City Group Team

I'm not sure which helpful person subscribed [EMAIL PROTECTED] to
the above mailing lists in the first place, but hopefully I've done
the right thing by unsubscribing them again!

Re: wget-1.8.1: build failure on SGI IRIX 6.5 with c89

2002-04-11 Thread Ian Abbott


On 11 Apr 2002 at 19:14, Hrvoje Niksic wrote:

 Nelson H. F. Beebe [EMAIL PROTECTED] writes:
 
  c89 -I. -I. -I/opt/include   -DHAVE_CONFIG_H 
-DSYSTEM_WGETRC=\/usr/local/etc/wgetrc\ -DLOCALEDIR=\/usr/local/share/locale\ -O 
-c connect.c
  cc-1164 c89: ERROR File = connect.c, Line = 94
Argument of type int is incompatible with parameter of type const char *.
 
  logprintf (LOG_VERBOSE, _(Connecting to %s[%s]:%hu... ),
  ^
 
  cc-1164 c89: ERROR File = connect.c, Line = 97
Argument of type int is incompatible with parameter of type const char *.
 
 The argument of type int is probably an indication that the `_'
 macro is either undefined or expands to an undeclared function.  The
 compiler rightfully assumes the function to return int and complains
 about the type mismatch.
 
 If you check why the macro is misdeclared, you'll likely discover the
 source of the problem.

Perhaps HAVE_NLS is defined but HAVE_LIBINTL_H isn't defined. That
would cause '_(string)' to expand to 'gettext (string)' but with no
declaration of the gettext() function, causing the compiler to assume a 
default declaration of 'int gettext()'.

I think we need to examine the 'config.log' file produced when
running './configure'.

Re: LAN with Proxy, no Router

2002-04-10 Thread Ian Abbott


On 10 Apr 2002 at 3:09, Jens Rösner wrote:

 wgetrc works fine under windows (always has)
 however, .wgetrc is not possible, but 
 maybe . does mean in root dir under Unix?

The code does different stuff for Windows. Instead of looking for
'.wgetrc' in the user's home directory, it looks for a file called
'wget.ini' in the directory that contains the executable. This does
not seemed to be mentioned anywhere in the documentation.

Re: -nv option; printing out infos via stderr [http://bugs.debian.org/141323]

2002-04-09 Thread Ian Abbott


On 9 Apr 2002 at 10:34, Hrvoje Niksic wrote:

 Ian Abbott [EMAIL PROTECTED] writes:
  On 5 Apr 2002 at 18:17, Noel Koethe wrote:
  Will this be changed so the user could use -nv with /dev/null
  and get only errors or warnings displayed?
 
  So what I think you want is for any log message tagged as
  LOG_VERBOSE (verbose information) or LOG_NONVERBOSE (basic
  information) in the source to go to stdout when no log file has been
  specified and the `-O -' option has not been used and for everything
  else to go to stderr?
 
 That change sounds dangerous.  Current Wget output doesn't really have
 a concept of errors that would be really separate from other output;
 it only operates on the level of verbosity.  This was, of course, a
 bad design decision, and I agree that steps need to be taken to change
 it.  I'm just not sure that this is the right step.

Neither am I, but I knocked up the patch on a whim.

 Suddenly `wget -o X' is no longer equivalent to `wget 2x', which
 violates the Principle of Least Surprise.

Perhaps we just need a --log-level=N option:

Level 0: output just the LOG_ALWAYS messages.
Level 1: output the above and LOG_NOTQUIET messages.
Level 2: output the above and LOG_NONVERBOSE messages.
Level 3: output the above and LOG_VERBOSE messages.

The --verbose option would be equivalent to --log-level=3 (the
default).

The --non-verbose option would be equivalent to --log-level=2.

The --quiet option would be equivalent to --log-level=1.

Noel would specify --log-level=1 to get the output he wants.

How does that sound?

Re: getting time stamp via FTP

2002-04-09 Thread Ian Abbott


On 8 Apr 2002 at 11:43, Urs Thuermann wrote:

 Please CC: any answers to my email address, since I'm not on this
 list.
 
 I'd like wget to get the time stamp of a file that is downloaded via
 FTP and to set the mtime after writing the file to the local disk.
 
 When using HTTP, this already happens, i.e. when doing a
 
 wget http://host/file
 
 the file has the same time stamp in the local file system as on the
 remote server, but not with FTP.  FTP supports the MODTIME command to
 get the time stamp of a file from the server.  Could wget be changed
 to use this?

the modtime command supported by some clients uses an FTP extension
(MDTM). How widely is this supported by FTP servers?

Wget recently adopted use of another extension (SIZE) and has long
supported another extension (REST), so it could potentially adopt
other extensions if commonly used.

Currently, Wget extracts the timestamp from a directory listing of
the file, but that doesn't always work, as the format for the
directory listing is not standardized. Ideally, I think Wget should
only have to fall back on old-style directory listings as a last
resort, but that will have to wait a few years for newer mechanisms
to be standardized and commonly adopted (i.e. the MLST/MLSD
extensions).

These links may be useful:
http://www.ietf.org/html.charters/ftpext-charter.html
http://www.ietf.org/internet-drafts/draft-ietf-ftpext-mlst-15.txt

Re: getting time stamp via FTP

2002-04-09 Thread Ian Abbott


On 9 Apr 2002 at 16:52, Ian Abbott wrote:

 Wget recently adopted use of another extension (SIZE) and has long
 supported another extension (REST), so it could potentially adopt
 other extensions if commonly used.

Correction: 'REST' is a standard FTP protocol command, not an
extension.

Re: forcing file overwrite

2002-04-05 Thread Ian Abbott


On 4 Apr 2002 at 17:13, Matthew Boedicker wrote:

 I am trying to wget Apache log files (via ftp) and since the new file will
 always contain at least the old, I want it to overwrite the file each time.
 
 Is there any way to do this?  If there isn't, may I suggest it as a new
 option?

I agree a new option to force clobbering would be nice. In the
meantime, a workaround for your case would be to use the -N
(--timestamping) option, which should have the desired effect.

Re: -nv option; printing out infos via stderr [http://bugs.debian.org/141323]

2002-04-05 Thread Ian Abbott


On 5 Apr 2002 at 18:17, Noel Koethe wrote:

 Will this be changed so the user could use -nv with /dev/null
 and get only errors or warnings displayed?

So what I think you want is for any log message tagged as
LOG_VERBOSE (verbose information) or LOG_NONVERBOSE (basic
information) in the source to go to stdout when no log file
has been specified and the `-O -' option has not been used
and for everything else to go to stderr?

I'm not sure what Hrvoje Niksic thinks of that idea, but here
is a source code patch to accomplish it. I'd like some second
opinions (preferably from Hrvoje) before committing it. The
patch does not include any documentation changes - these will
follow if the patch is committed.

N.B. The patch contains a form-feed. I'm not sure if that will
survive the email passage.

2002-04-05  Ian Abbott  [EMAIL PROTECTED]

* wget.h (enum log_options): Set order to `LOG_VERBOSE', `LOG_NONVERBOSE',
`LOG_NOTQUIET', `LOG_ALWAYS' to reflect relative importance of the log
messages to which they are associated.

* log.c (get_log_fp): Add parameter to indicate logging level.  If a log
file is not being used, send `LOG_VERBOSE' and `LOG_NONVERBOSE' logs to
`stdout' instead of to `stderr', except when output documents are going to
`stdout'.
(logputs): Pass logging level to `get_log_fp()'.
(logvprintf_state): Include logging level in the state.
(logvprintf): Pass logging level (from passed state) to `get_log_fp()'.
(logflush): If some logs go to `stderr' and some to `stdout', ensure that
both streams get flushed.
(logprintf): Put logging level in state passed to `logvprintf()'.
(debug_logprintf): Put `LOG_VERBOSE' logging level in state passed to
`logvprintf()'.
(log_init): If no log file specified, don't set `logfp' to `stderr' -
leave it set to NULL so that `get_log_fp()' can decide whether to return
`stdout' or `stderr' based on the logging level (and other factors). In
this case, ensure logs get saved to memory if either of `stderr' or
`stdout' is a TTY.
(log_dump_context): Use `logfp' value directly instead of calling
`get_log_fp()'.

Index: src/log.c
===
RCS file: /pack/anoncvs/wget/src/log.c,v
retrieving revision 1.12
diff -u -r1.12 log.c
--- src/log.c   2001/12/19 09:36:58 1.12
+++ src/log.c   2002/04/05 18:03:44
@@ -287,12 +287,16 @@
If logging is inhibited, return NULL.  */
 
 static FILE *
-get_log_fp (void)
+get_log_fp (enum log_options o)
 {
   if (inhibit_logging)
 return NULL;
   if (logfp)
 return logfp;
+  if (opt.dfp == stdout)
+return stderr;
+  if (o  LOG_NOTQUIET)
+return stdout;
   return stderr;
 }
 


@@ -305,7 +309,7 @@
   FILE *fp;
 
   check_redirect_output ();
-  if (!(fp = get_log_fp ()))
+  if (!(fp = get_log_fp (o)))
 return;
   CHECK_VERBOSE (o);
 
@@ -322,6 +326,7 @@
   char *bigmsg;
   int expected_size;
   int allocated;
+  enum log_options o;
 };
 
 /* Print a message to the log.  A copy of message will be saved to
@@ -341,7 +346,7 @@
   char *write_ptr = smallmsg;
   int available_size = sizeof (smallmsg);
   int numwritten;
-  FILE *fp = get_log_fp ();
+  FILE *fp = get_log_fp (state-o);
 
   if (!save_context_p)
 {
@@ -411,9 +416,12 @@
 void
 logflush (void)
 {
-  FILE *fp = get_log_fp ();
-  if (fp)
-fflush (fp);
+  FILE *fp1 = get_log_fp (LOG_VERBOSE);
+  FILE *fp2 = get_log_fp (LOG_ALWAYS);
+  if (fp1)
+fflush (fp1);
+  if (fp2  (fp2 != fp1))
+fflush (fp2);
   needs_flushing = 0;
 }
 
@@ -497,6 +505,7 @@
   CHECK_VERBOSE (o);
 
   memset (lpstate, '\0', sizeof (lpstate));
+  lpstate.o = o;
   do
 {
   VA_START_2 (enum log_options, o, char *, fmt, args);
@@ -532,6 +541,7 @@
return;
 
   memset (lpstate, '\0', sizeof (lpstate));
+  lpstate.o = LOG_VERBOSE;
   do
{
  VA_START_1 (char *, fmt, args);
@@ -559,13 +569,10 @@
 }
   else
 {
-  /* The log goes to stderr to avoid collisions with the output if
- the user specifies `-O -'.   Francois Pinard suggests
- that it's a better idea to print to stdout by default, and to
- stderr only if the user actually specifies `-O -'.  He says
- this inconsistency is harder to document, but is overall
- easier on the user.  */
-  logfp = stderr;
+  /* LOG_NOTQUIET and LOG_ALWAYS logs will go to stdwrr. Other logs
+ will go to stdout unless the user specifies `-O -'.  This allows
+ the user to redirect standard output but still see errors and
+ warnings if standard error is a TTY.  */
 
   /* If the output is a TTY, enable storing, which will make Wget
  remember all the printed messages, to be able to dump them to
@@ -573,7 +580,7 @@
  Ctrl+Break is pressed under Windows).  */
   if (1
 #ifdef

Re: URI-parsing bug

2002-04-04 Thread Ian Abbott


On 4 Apr 2002 at 5:51, Tristan Horn wrote:

 Just wanted to point out that as of version 1.8.1, wget doesn't correctly
 recognize A HREF=//foo/bar-style links.
 
 tris.net/index.html: merge(http://tris.net/;, //www.arrl.org/) - 
http://tris.net//www.arrl.org/
 
 (it should return http://www.arrl.org/)

There haven't been any releases since 1.8.1, but this bug is fixed
in the current CVS version.

Re: Serious bug in recursive retrieval behaviour occured in v. 1.8

2002-04-04 Thread Ian Abbott


On 4 Apr 2002 at 13:21, Robert Mücke wrote:

 So it seems to be important to correct this behaviour. I think you only need
 to set up a test site (maybe with some subdirs) containing one file with
 an errorous href= tag to reproduce this (maybe only in parts
 depending on your server configuration).

I couldn't reproduce this with wget 1.8 and a local Apache server
(but I didn't attempt to reconfigure Apache in an attempt to
reproduce it).

A few recursive retrieval bugs were fixed in wget 1.8.1. Is it
possible for you to test that version? (You may want to limit the
recursion depth and the maximum amount to download if repeating the
test!)

Re: cuj.com file retrieving fails -why?

2002-04-03 Thread Ian Abbott


On 3 Apr 2002 at 14:56, Markus Werle wrote:

 Jens Rösner wrote:
  So, I do not know what your problem is, but is neither wget#s nor cuj's
  fault, AFAICT.
 
 :-(

I've just built Wget 1.7 on Linux and it seemed to download your
problem file okay. So I don't know what your problem is either!

Re: cuj.com file retrieving fails -why?

2002-04-03 Thread Ian Abbott


On 3 Apr 2002 at 17:09, Markus Werle wrote:

 Ian Abbott wrote:
 
  On 3 Apr 2002 at 14:56, Markus Werle wrote:
  I've just built Wget 1.7 on Linux and it seemed to download your
  problem file okay. So I don't know what your problem is either!
 
 Ah! The kind of problem I like most!
 Did You have a special .wgetrc?

Nothing special.

$HOME/.wgetrc :

robots = off

system wgetrc :

# Comments stripped out
passive_ftp = on
waitretry = 10

Re: spanning hosts

2002-04-02 Thread Ian Abbott


On 28 Mar 2002 at 18:01, Jens Rösner wrote:

   I came across a crash caused by a cookie
   two days ago. I disabled cookies and it worked.
  I'm hoping you had debug output on when it crashed, otherwise this
  is a different crash to the one I already know about. Can you
  confirm this, please?
 
 Yes, I had debug output on.

Thanks for the confirmation.

   wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d
   -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html
  
   Result with 1.8.1 and 1.7.1 with -nh:
   audistory.com: Only index.html
   audistory.de: Everything
   audi100-online: only the first page
   kolaschnik.de: only the first page
  
  Yes, that's how I thought it would behave. Any URLs specified on
  the command line or in a --include-file file are always downloaded
  irregardless of the domain acceptance rules. 
 
 Well, one page of a rejected URL is downloaded, not more.
 Whereas the only accepted domain audistory.de gets downloaded
 completely.
 Doesn't this differ from what you just said?

Well I only said the URLs specified on the command line or by the
--include-file option are always downloaded. I didn't intend this
to be interpreted as also applying to URLs which Wget finds while
examining the contents of the downloaded html files. At the moment,
the domain acceptance/rejection checks are only performed when
downloaded html files are examined for further URLs to be
downloaded (for the --recursive and --page-requisites options),
which is why it behaves as it does.

 Agreed! How about introducing wildcards like 
 -Dbar.com behaves strictly: www.bar.com, www2.bar.com
 -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com
 -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com,
 sex-bar.computer-dating.com
 That would leave current command lines operational 
 and introduce many possibilities without (too much) fuss.
 Or have I overlooked anything here?

It sounds like it should work okay. I'd prefer to let -Dbar.com
also match fubar.com for compatibility's sake. If you wanted to
match www.bar.com and www2.bar.com, but not www.fubar.com you
could use -D.bar.com, but that wouldn't work if you wanted to
match bar.com without the www (well, a leading . could be treated
as a special case).

It would be easiest and more consistent (currently) to use
shell-globbing wildcards (as used for the file-acceptance
rules) rather than grep/egrep-style wildcards.

Re: about wget and put

2002-04-02 Thread Ian Abbott


On 31 Mar 2002 at 14:23, ¶À«¾§ wrote:

 may I ask some question?
 do wget offer put function?  (FTP put)

No current version of wget offers this function.

 I need wget function, but reverse way, like put...
 can wget do it? or is there any tool offer this?

There is a command-line tool called curl which can get and put
by HTTP and FTP.

There is another command-line program called lftp which will also
do this.

 ps. I need put the newer or modified files, by automatically
 judge...like wget does...

I don't think either program does that.

Re: wget parsing JavaScript

2002-03-27 Thread Ian Abbott


On 26 Mar 2002 at 19:33, Tony Lewis wrote:

 I wrote:
 
   wget is parsing the attributes within the script tag, i.e., script
   src=url. It does not examine the content between script and
 /script.
 
 and Ian Abbott responded:
 
  I think it does, actually, but that is mostly harmless.
 
 You're right. What I meant was that it does not examine the JavaScript
 looking for URLs.

It won't examine the file downloaded via script src=ascript.js
(unless the HTTP response claimed it had a MIME type of text/html
for some reason!), but it will examine the contents between a
script and a /script tag.

For example, a recursive retrieval on a page like this:

html
  body
script
  a href=foo.htmlfoo/a
/script
  /body
/html

will retrieve foo.html, regardless of the script.../script
tags.

Re: wget parsing JavaScript

2002-03-26 Thread Ian Abbott


On 26 Mar 2002 at 7:05, Tony Lewis wrote:

 Csaba Ráduly wrote:
 
  I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to
  parse whatever it's inside.
  Why was this implemented ? JavaScript is most
  used to construct links programmatically. wget is likely to find
  bogus URLs until it can properly parse JavaScript.
 
 wget is parsing the attributes within the script tag, i.e., script
 src=url. It does not examine the content between script and /script.

I think it does, actually, but that is mostly harmless. I haven't
heard of any cases where it has caused a problem (assuming the
script is well-formed). It's normal good practice to hide the code
in a HTML comment anyway, but perhaps that good practice is less
common now these days now that virtually every browser out there
groks SCRIPT/SCRIPT and NOSCRIPT/NOSCRIPT.

Wget's HTML parser doesn't yet have the hooks to allow different
elements (such as SCRIPT and STYLE) to be processed differently to
normal HTML. If it gets these hooks it could then go off and
process the SCRIPT element differently. (The minimal processing
for the SCRIPT element, if it is using an an unsupported script
language would be to skip it.)

If a future version of Wget were to handle JavaScript as an option
(perhaps using the GPL'd SpiderMonkey), it would have to parse the
default action of the script and also possibly exercise the various
event handlers to gather more URLs. I guess this would fail on
the more complicated scripts that expect some sort of intelligent
being (or a suitably programmed robot) to fill in forms and/or
press buttons in the correct sequence to progress to the next page!

Re: spanning hosts: 2 Problems

2002-03-26 Thread Ian Abbott


On 26 Mar 2002 at 19:01, Jens Rösner wrote:

 I am using wget to parse a local html file which has numerous links into
 the www.
 Now, I only want hosts that include certain strings like 
 -H -Daudi,vw,online.de

It's probably worth noting that the comparisons between the -D
strings and the domains being followed (or not) is anchored at
the ends of the strings, i.e. -Dfoo matches bar.foo but not
foo.bar.

 Two things I don't like in the way wget 1.8.1 works on windows:
 
 The first page of even the rejected hosts gets saved.

That sounds like a bug.

 This messes up my directory structure as I force directories 
 (which is my default and normally useful)
 
 I am aware that wget has switched to breadth first (as opposed to
 depth-first) 
 retrieval.
 Now, with downloading from many (20+) different servers, this is a bit
 frustrating, 
 as I will probably have the first completely downloaded site in a few
 days...

Would that be less of a problem if the first problem (first page
from rejected domains) was fixed?

 Is there any other way to work around this besides installing wget 1.6
 (or even 1.5?)

No, but note that if you pass several starting URLs to Wget, it
will complete the first before moving on to the second. That also
works for the URLs in the file specified by the --input-file
parameter. However, if all the sites are interlinked, you would be
no better off with this. The other alternative is to run wget
several times in sequence with different starting URLs and restrictions, perhaps using 
the --timestamping or --no-clobber
options to avoid downloading things more than once.

Re: OK, time to moderate this list

2002-03-22 Thread Ian Abbott


On 22 Mar 2002 at 4:08, Hrvoje Niksic wrote:

 The suggestion of having more than one admin is good, as long as there
 are people who volunteer to do it besides me.

I'd volunteer too, but don't want to be the only person moderating
the lists for the same reasons as yourself. (I'm also completely
clueless about the process of moderating mailing lists at the
moment!)

 I also have to check with the sunsite.dk people whether the ML
 manager, ezmlm, can handle this.

If it only handles a single moderator account, perhaps a secure
web-based email account could be set up for moderation purposes
which the real moderators could log into on a regular basis.

Re: Wget and Symantec Web Security

2002-03-21 Thread Ian Abbott


On 19 Mar 2002 at 22:53, Löfstrand Thomas wrote:

 I use wget to get files from a FTP server.
 
 The proxy server is Symantecs web security 2.0 product for solaris
 which has a antivirus function.
 I have used wget with -d option to see what is going on, and it seems
 like the proxyserver returns the following response: X-PLEASE_WAIT.
 
 After reading the source code in http.c it seems like wget expects
 the answer from the proxy to be HTTP/ and a version number.
 
 Is there any easy way to bypass this response part or to make a little bit
 of coding so I can accept the X-PLEASE-WAIT String?

Your proxy server has a broken HTTP implementation.

Does this temporary patch to Wget 1.8.1 work around the problem?

--- src/http.c.old  Thu Mar 21 17:43:25 2002
+++ src/http.c  Thu Mar 21 18:01:15 2002
 -949,6 +949,16 
   if (hcount == 1)
{
  const char *error;
+
+ /* TEMPORARY PATCH */
+ /* Check for broken Symantec Web Security proxy. */
+ if (strncmp(hdr, X-PLEASE_WAIT, 13) == 0)
+   {
+ hcount--;
+ goto done_header;
+   }
+ /* TEMPORARY PATCH */
+
  /* Parse the first line of server response.  */
  statcode = parse_http_status_line (hdr, error);
  hs-statcode = statcode;

Re: wget1.8.1's patches for using the free Borland C++Builder compile r

2002-03-18 Thread Ian Abbott


On 12 Mar 2002 at 3:18, sr111 wrote:

  I have to modify some files in order to build
 win32 port of wget using the free Borland C++Builder
 compiler.  Please refer to the attachment file for the
 details.

I've modified Chin-yuan Kuo's patch for the current CVS. It builds
fine with the free Borland C++Builder compiler.

I also tried to build it with the Borland C++ Release 5.0 compiler
but ran into problems compiling src/utils.c on the following lines
1499-1500:

  wt-wintime.HighPart = ft.dwHighDateTime;
  wt-wintime.LowPart  = ft.dwLowDateTime;

(Those errors were nothing to do with Chin-yuan Kuo's patch.)

Chin-yuan's patch distances the support for Borland's compilers
further away from the Release 5.0 (and earlier) compilers, but
since the C++Builder compiler can be downloaded for free I don't
think support for the older compilers is that much of an issue
(apart from making a little more work for the MS-DOS porters).

If there are no objections from the Win32 maintainers, I'll apply
the updated patch to CVS tomorrow.

Chin-yuan did not submit any ChangeLog entries, so here is my
attempt at some:

main ChangeLog entry:

2002-03-18  Chin-yuan Kuo  [EMAIL PROTECTED]

* configure.bat.in: Do not check %BORPATH% as C++Builder compiler
does not use it.
* windows/Makefile.src.bor:
* windows/config.h.bor:
Migrate to free C++Builder compiler.

And here is the updated patch:

Index: configure.bat.in
===
RCS file: /pack/anoncvs/wget/configure.bat.in,v
retrieving revision 1.1
diff -u -r1.1 configure.bat.in
--- configure.bat.in2002/03/13 19:47:26 1.1
+++ configure.bat.in2002/03/18 20:31:51
@@ -20,8 +20,7 @@
 if .%1 == .--borland goto :borland
 if .%1 == .--msvc goto :msvc
 if .%1 == .--watcom goto :watcom
-if not .%BORPATH% == . goto :borland
-if not .%1 == . goto :usage
+goto :usage
 
 :msvc
 copy windows\config.h.ms src\config.h  nul
@@ -58,5 +57,5 @@
 goto :end
 
 :usage
-echo Usage: Configure [--borland | --msvc | --watcom]
+echo Usage: configure [--borland | --msvc | --watcom]
 :end
Index: windows/Makefile.src.bor
===
RCS file: /pack/anoncvs/wget/windows/Makefile.src.bor,v
retrieving revision 1.4
diff -u -r1.4 Makefile.src.bor
--- windows/Makefile.src.bor2001/12/04 10:33:18 1.4
+++ windows/Makefile.src.bor2002/03/18 20:31:52
@@ -2,16 +2,16 @@
 ## Makefile for use with watcom win95/winnt executable.
 
 CC=bcc32
-LINK=tlink32
+LINK=ilink32
 
 LFLAGS=
-CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w-
+CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w- -O2
 
 ## variables
 OBJS=cmpt.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj  \
-  ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html.obj \
+  ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html-parse.obj html-
url.obj \
   http.obj init.obj log.obj main.obj gnu-md5.obj netrc.obj rbuf.obj  \
-  alloca.obj \
+  safe-ctype.obj hash.obj progress.obj gen-md5.obj cookies.obj \
   recur.obj res.obj retr.obj url.obj utils.obj version.obj mswindows.obj
 
 LIBDIR=$(MAKEDIR)\..\lib
@@ -20,7 +20,9 @@
   $(LINK) @|
 $(LFLAGS) -Tpe -ap -c +
 $(LIBDIR)\c0x32.obj+
-alloca.obj+
+cookies.obj+
+hash.obj+
+safe-ctype.obj+
 version.obj+
 utils.obj+
 url.obj+
@@ -37,7 +39,8 @@
 log.obj+
 init.obj+
 http.obj+
-html.obj+
+html-parse.obj+
+html-url.obj+
 host.obj+
 headers.obj+
 getopt.obj+
Index: windows/config.h.bor
===
RCS file: /pack/anoncvs/wget/windows/config.h.bor,v
retrieving revision 1.3
diff -u -r1.3 config.h.bor
--- windows/config.h.bor2001/11/29 14:15:10 1.3
+++ windows/config.h.bor2002/03/18 20:31:52
@@ -19,6 +19,10 @@
 #ifndef CONFIG_H
 #define CONFIG_H
 
+#define HAVE_MEMMOVE
+#define ftruncate chsize
+#define inline __inline
+
 /* Define if you have the alloca.h header file.  */
 #undef HAVE_ALLOCA_H
 
@@ -33,7 +37,7 @@
  #pragma alloca
 #  else
 #   ifndef alloca /* predefined by HP cc +Olibcalls */
-char *alloca ();
+#include malloc.h
 #   endif
 #  endif
 # endif
@@ -177,7 +181,7 @@
 #define HAVE_BUILTIN_MD5 1
 
 /* Define if you have the isatty function.  */
-#undef HAVE_ISATTY
+#define HAVE_ISATTY
 
 
 #endif /* CONFIG_H */

(Fwd) Proposed new --unfollowed-links option for wget

2002-03-08 Thread Ian Abbott

This seems more appropriate for the main Wget list. The 
wget-patches list is for patches!

--- Forwarded message follows ---
From:   Tony Lewis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject:Proposed new --unfollowed-links option for wget
Date sent:  Thu, 7 Mar 2002 23:41:15 -0800

Last night I was roaming through Google looking for a program to let me grab
chunks of a web site and found a reference to wget. After reading the
manual, I downloaded and built it and found that it does almost everything I
needed. There are two features that I need that are missing. One of them is
getting a list of the links that were not followed by wget. (The other is
the subject of another message.)

I have skimmed a few GNU programs in the past and found the source for wget
pretty easy to follow. I was able to implement this feature today by adding
the following command line argument:

  -u,  --unfollowed-links=FILE  log unfollowed links to FILE.

Having used the option on a couple of sites that I maintain, I have already
found it very useful. For example: after running wget --mirror -uexternal
http://www.mysite.com;, I have a list of all the external references made by
my site in the file external.

Unfortunately, I made all my changes directly to the distribution sources
before I stumbled across the long list of instructions for using CVS. Before
I redo the changes following the CVS route, I'd like to know a little bit
more about the process for getting a submission approved for inclusion in a
future version (particularly in light of this change grabbing one of the
seven -- by my count -- remaining single-letter command line arguments).

Also, is there some sort of regression test suite that I should run?

Tony Lewis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--- End of forwarded message ---

(Fwd) Processing of JavaScript

2002-03-08 Thread Ian Abbott

--- Forwarded message follows ---
From:   Tony Lewis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject:Processing of JavaScript
Date sent:  Fri, 8 Mar 2002 00:04:43 -0800

Some web sites include URL references within 
JavaScript. Poorly designed
sites (including one of my own, I must confess) build 
significant site
navigation features in script.

Has anyone thought about what it would take to have 
wget parse the
JavaScript looking for urls? I have looked briefly 
(very briefly) at
SpiderMonkey and I suspect it could be integrated with 
wget.

Thoughts?

Tony Lewis

PS) I'm done for tonight! ;-)

--- End of forwarded message ---

(Fwd) Automatic posting to forms

2002-03-08 Thread Ian Abbott

--- Forwarded message follows ---
From:   Tony Lewis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject:Automatic posting to forms
Date sent:  Thu, 7 Mar 2002 23:43:28 -0800

As promised in my earlier note, there is a second 
feature I'm looking for in
wget. This feature is the ability to automatically post 
to forms. I'm
thinking of something along the lines of a command line 
argument like:

  --auto-post=FILE

where FILE would contain data such as:

form=/cgi-bin/auth.cgi
name=id value=tony
name=pw value=password

With this information, any time that wget encounters a 
form whose action is
/cgi-bin/auth.cgi, it will enqueue the submission of 
the form using the
values provided for the fields id and pw.

Before I go too deep into making this change, I'd like 
some feedback. I know
that I will need to change:
- get_urls_html to look for a FORM tag whose action 
attribute matches the
auto-post file
- retrieve_tree to be able to POST as well as GET
- main and initialize to deal with the new command line 
argument

Is there anything else that seems obvious that I'm 
overlooking? Any cautions
about the sections of code I'll be working with?

Tony Lewis

-
To unsubscribe, e-mail: wget-patches-
[EMAIL PROTECTED]
For additional commands, e-mail: wget-patches-
[EMAIL PROTECTED]
--- End of forwarded message ---

Re: reading HTML input-files (WITH ATTACHMNT!)

2002-03-08 Thread Ian Abbott


On 8 Mar 2002 at 10:50, Mathias Kratzer wrote:

 I admit that the lines  in my original file   contain a really  stupid
 syntax error. As an absolute beginner with the Markup Languages I have
 just tried   to learn   from  some  hyperlink examples  but  obviously
 misunderstood their  formal  structure.  Nevertheless, Wget  1.5.2 did
 recognize my URLs!

Well, as you noted, the HTML parser was rewritten for Wget 1.7, so
it is not too surprising that it would behave differently for
erroneous input!

 So does Wget 1.7 after I've changed the lines to SGML format. However,
 I feel obliged to inform you that XML format didn't solve the problem.

Ah yes, the XML (XHTML) form was not supported until Wget 1.8 or
1.8.1 (I can't remember which, and can't be arsed to find out at
the moment!).

Re: reading HTML input-files (WITH ATTACHMNT!)

2002-03-07 Thread Ian Abbott


On 7 Mar 2002 at 17:50, Mathias Kratzer wrote:

 While calling Wget 1.5.2 by
 
   wget -F -O 69_4_522_Ref.res -i 69_4_522_Ref.mrq
 
 on the attached file 69_4_522_Ref.mrq has worked very well I am left
 with the error message 
 
   No URLs found in 69_4_522_Ref.mrq 
 
 whenever I try the same command using Wget 1.7. Even embedding the
 content of 69_4_522_Ref.mrq into a HTML4 frame (i.e. DOCTYPE-header,
 html-, head- and body-tags) did not help.
 
 Can you tell me what I am doing wrong?

The file 69_4_522_Ref.mrq contains several lines of the form:

  a href=url/a

which looks pretty invalid to me. Perhaps you need to change them
to:

  a href=url/ (XML format)

or:

  a href=url/a  (SGML format)

Re: retr.c:253: calc_rate: Assertion `msecs = 0' failed.

2002-03-06 Thread Ian Abbott


On 6 Mar 2002 at 12:43, Mats Palmgren wrote:

 I have a cron job that downloads Mozilla every night using wget.
 Last night I got:
 
 wget: retr.c:253: calc_rate: Assertion `msecs = 0' failed.

I think this can happen if the system time is reset backwards while
wget is downloading stuff.

Re: wget info page

2002-02-20 Thread Ian Abbott


On 20 Feb 2002 at 12:54, Noel Koethe wrote:

 wget 1.8.1 is shipped with the files in doc/
 wget.info
 wget.info-1
 wget.info-2
 wget.info-3
 wget.info-4
 
 They are build out of wget.texi if I remove them
 and makeinfo is installed.
 
 The files are removed when runing make realclean.
 I think they should/could also removed when runing
 make distclean, or am I missing an important point?

Perhaps they are included in the distribution in case the system
does not have the tools to rebuild them? However, the presence
of wget.info* in the distribution does seem inconsistent with the
absence of the wget.1 manpage file.

No clobber and .shtml files

2002-02-20 Thread Ian Abbott


Here is a patch for a potential feature change. I'm not sending it
to the wget-patches list yet, as I'm not sure if it should be
applied as is, or at all.

The feature change is a minor amendment to the (bogus) test for
whether an existing local copy of a file is text/html when the or
not when the --noclobber option is used, based on its suffix.

The current test assumes the local file is text/html if it has a
suffix of html or htm. The amendment made by this patch
includes suffixes of the form shtml, phtml, etc. in the set
of suffixes assumed to indicate text/html files.

As it stands, the new test treats any ?html suffix (where ?
matches a single character as indicating a text/html file. Perhaps
this test should be tightened up to only allow a letter rather than
any character in this position.

I didn't bother testing for ?htm, as I've never seen it and can't
think why anyone would want to use it. (However, I do recall seeing
suffixes such as sht before now, i.e. shtml truncated to 3
characters, but perhaps that's going too far.)

Any comments?

Index: src/http.c
===
RCS file: /pack/anoncvs/wget/src/http.c,v
retrieving revision 1.85
diff -u -r1.85 http.c
--- src/http.c  2002/02/19 05:18:43 1.85
+++ src/http.c  2002/02/20 19:25:34
@@ -1462,8 +1462,10 @@
 
   /*  Bogusness alert.  */
   /* If its suffix is html or htm, assume text/html.  */
-  if (((suf = suffix (*hstat.local_file)) != NULL)
-  (!strcmp (suf, html) || !strcmp (suf, htm)))
+  /* Also assume text/html if its suffix is shtml, phtml, etc. */
+  if (((suf = suffix (*hstat.local_file)) != NULL)  *suf
+  (!strcmp (suf, html) || !strcmp (suf, htm)
+ || !strcmp(suf+1, html)))
*dt |= TEXTHTML;
 
   FREE_MAYBE (dummy);

Re: wget bug?!

2002-02-18 Thread Ian Abbott


[The message I'm replying to was sent to [EMAIL PROTECTED]. I'm
continuing the thread on [EMAIL PROTECTED] as there is no bug and
I'm turning it into a discussion about features.]

On 18 Feb 2002 at 15:14, TD - Sales International Holland B.V. wrote:

 I've tried -w 30
 --waitretry=30
 --wait=30 (I think this one is for multiple files and the time in between 
 those though)
 
 None of these seem to make wget wanna wait for 30 secs before trying again. 
 Like this I'm hammering the server.

The --waitretry option will wait for 1 second for the first retry,
then 2 seconds, 3 seconds, etc. up to the value specified. So you
may consider the first few retry attempts to be hammering the
server but it will gradually back off.

It sounds like you want an option to specify the initial retry
interval (currently fixed at 1 second), but Wget currently has no
such option, nor an option to change the amount it increments by
for each retry attempt (also currently fixed at 1 second).

If such features were to be added, perhaps it could work something
like this:

--waitretry=n - same as --waitretry=n,1,1
--waitretry=n,m   - same as --waitretry=n,m,1
--waitretry=n,m,i - wait m seconds for the first retry,
incrementing by i seconds for subsequent
retries up to a maximum of n seconds

The disadvantage of doing it that way is that no-one will remember
which order the numbers should appear, so an alternative is to
leave --waitretry alone and supplement it with --waitretryfirst
and --waitretryincr options.

Re: wget crash

2002-02-15 Thread Ian Abbott


On 14 Feb 2002 at 16:02, Steven Enderle wrote:

 Sorry for not including any version information.
 
 This is version 1.8.1, which I am using.

Sorry for not reading your bug report properly. I should have
realised that this was a different bug to the hundreds (it seems!)
of other reports about assertion failures in progress.c.

Re: wget crash

2002-02-14 Thread Ian Abbott


On 14 Feb 2002 at 10:41, Steven Enderle wrote:

 assertion percentage = 100 failed: file progress.c, line 552
 zsh: abort (core dumped)  wget -m -c --tries=0 
 ftp://ftp.scene.org/pub/music/artists/nutcase/mp3/timeofourlives.mp3
 
 hope this helps in any way.

Thanks for the report. That's a known bug in Wget 1.8 that is fixed in Wget 1.8.1.

Re: wget 1.8.x proxies

2002-02-12 Thread Ian Abbott


On 12 Feb 2002 at 12:30, Holger Pfaff wrote:

 I'm having trouble using wget 1.8.[01] over a (squid24-) proxy
 to mirror a ftp-directory:
 
 # setenv ftp_proxy http://139.21.68.25:
 # wget181 -r -np -l0 ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates
 
 --12:06:58--  ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates
 = `ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates'
 Connecting to 139.21.68.25:... connected.
 Proxy request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 
  [ = ] 3,665  3.50M/s
 
 12:06:58 (3.50 MB/s) - `ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates' 
saved [3665]

I've never tried wget through an http-based ftp proxy.

Are there any clues in the file it wrote (presumably a html-format
directory listing)?

Are there any more clues if you use the -d (--debug) option?

Re: wget 1.8.x proxies

2002-02-12 Thread Ian Abbott


On 12 Feb 2002 at 7:54, Winston Smith wrote:

  # wget181 -r -np -l0
  ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates
 
 ummm... looks like the -l0 might be limiting your recursion level to 0
 levels

No. '-l0' is the same as '-l inf'.

Re: KB or kB

2002-02-08 Thread Ian Abbott


On 8 Feb 2002 at 4:26, Fred Holmes wrote:

 At 02:54 AM 2/8/2002, Hrvoje Niksic wrote:
 Wget currently uses KB as abbreviation for kilobyte.  In a Debian
 bug report someone suggested that kB should be used because it is
 more correct.  The reporter however failed to cite the reference for
 this, and a search of the web has proven inconclusive.
 
 Well, certainly among physicists, the k for kilo = x1000 is lower 
 case.  Consult any style manual for writing articles in scholarly physics 
 journals.  Of course, computer folks do as they please. g

Not just amongst physicists, k is the standard prefix for kilo,
at least when kilo means 10^3 (=1000). Think km = kilometer
(or kilometre), kg = kilogram (or kilogramme), etc.

This does not really apply to computer usage where typically kilo
has been overloaded to mean 2^10 (=1024) because it happens to be
close enough to its more correct meaning. That's why K is often
used to mean 2^10 to avoid confusion with k. (But as has been
pointed out, this confusion persists for M, G, T, etc.)

I'd suggest either leaving them alone or adopting the IEC standards
that Henrik referred to, i.e. KiB = kibibyte = 2^10 bytes,
MiB = mebibyte = 2^20 bytes, etc. Of course, that would likely
produce asserts in progress.c ;-)

Re: @ sign in username

2002-02-04 Thread Ian Abbott


On 4 Feb 2002 at 15:21, Christian Busch wrote:

 Hello,
 
 i have a question. On a ftp-site that we need to mirror, our login is
 wget -cm
 ftp://christian.busch%40brainjunction.de:**xx**@esd.intraware.com/
 
 as you see I tried to encode the @ as %40 as described in the manual.
 This does not work, is there any way to encode the @ in the username ?

No, but does the following work?

wget -cm -e [EMAIL PROTECTED] ftp://esd.intraware.com/

FYI, there was no need to forward your the message to
[EMAIL PROTECTED] unless you were submitting a bug report. All the
traffic sent to [EMAIL PROTECTED] ends up on the [EMAIL PROTECTED]
list anyway.

HTTP/1.1 (was Re: timestamping content-length --ignore-length)

2002-02-01 Thread Ian Abbott


On 1 Feb 2002 at 8:17, Daniel Stenberg wrote:

 You may count this mail as advocating for HTTP 1.1 support, yes! ;-)

I did write down some minimal requirements for HTTP/1.1 support on
a scrap of paper recently. It's probably still buried under the
more recent strata of crap on my desk somewhere! I know chunked
encoding support was one of the requirements, but I can't remember
any others I wrote down. It was probably an incomplete list anyway!

HTTP/1.1 support would also allow gzip and deflate encodings etc.
to be added as configurable options later.

Once HTTP/1.1 support was working reliably, it ought to be made the
default, with command-line or .wgetrc options to fall back to
sending HTTP/1.0 requests.

Re: Downloading all files by http:

2002-01-31 Thread Ian Abbott


On 31 Jan 2002 at 9:25, Fred Holmes wrote:

 wget -N http://www.karenware.com/progs/*.*
 
 fails with a not found whether the filespec is * or *.*
 
 The * syntax works just fine with ftp
 
 Is there a syntax that will get all files with http?

You could try

  wget -m -l 1 -n http://www.karenware.com/progs/

but it will only do what you want if the web server sends back a
HTML-format directory listing (complete with links to each file),
rather than some other document.

Re: timestamping content-length --ignore-length

2002-01-31 Thread Ian Abbott


On 31 Jan 2002 at 8:41, Bruce BrackBill wrote:

 The problem is, that my web pages are served up by php
 and the content lengh is not defined.  So as the manual states
 I use --ignore-length.  But when wget retrieves an image
 it slows right down, possibly because it is ignoring
 the content-length.  Maybe an option to ignore the
 content length of certain file types ( say text/html )
 would be an option for upcomming releases of wget.

The problem is that wget uses persistent connections by default if
the server supports them. As you are using --ignore-length, wget
must wait for more data will arrive while the connection is open.
The persistent connection is closed by the server after a timeout
- as far as it is concerned, it has already completed the request
and is waiting for a new request to re-use the same connection.
This timeout is what is causing the delays you are seeing.

You can tell wget not to allow persistent connections using the
--no-http-keep-alive option, which should speed things up in your
case.

By the way, have you tried it without the --ignore-length option
to see if it works?

Perhaps the manual ought to mention the undesirability of using
--ignore-length with persistent connections.

Re: timestamping content-length --ignore-length

2002-01-31 Thread Ian Abbott


On 31 Jan 2002 at 9:48, Bruce BrackBill wrote:

 Thanks for your responce Ian.  When I use it without
 --ignore-length option it appears that wget SOMETIMES ignores
 the last_modified_date OR wget says to itself ( hey, I see the
 file is older than the local copy, but hey, since the server
 isn't sending me a content_length i'm just going to download it
 again anyway :-).  According the the manual ( as I read it )
 wget should ALWAYS reget the file if it has an empty content
 length ( even though this is undesirable behavior ).

Sorry I ignored the timestamping part of your question. My answer
only addressed the delays you were getting.

It depends on the SOMETIMES. Can you provide a sample debug
output log (-d) and point out where you think wget is not
behaving like you want it to? Also, you haven't mentioned which
version of wget you are using yet.

Wget should behave exactly the same for --ignore-length as it does
when there is no Content-Length header, and as far as I can see
from the source code, it does. If no Content-Length header was
received, or it was ignored then only the timestamps are compared.

Although the manual says that the Content-Length is used as an
additional check, it fails to mention that that only applies when
the Content-Length header exists and the --ignore-length option has
not been used.

 2) In the php scripts I send out last_modified_date
 3) php does not send content_length ( and I don't do it
 either in the script )

In that case, Wget's timestamping retrieval decision is based
solely on the Last-Modified header, regardless of whether you use
--ignore-length or not. A debug log would help confirm this.

Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-21 Thread Ian Abbott


On 17 Jan 2002 at 2:15, Hrvoje Niksic wrote:

 Michael Jennings [EMAIL PROTECTED] writes:
  WGet returns an error message when the .wgetrc file is terminated
  with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the
  command-line language for all versions of Windows, so ignoring the
  end-of-file mark would make sense.
 
 Ouch, I never thought of that.  Wget opens files in binary mode and
 handles the line termination manually -- but I never thought to handle
 ^Z.

Why not just open the wgetrc file in text mode using
fopen(name, r) instead of rb? Does that introduce other
problems?

In the Windows C compilers I've tried (Microsoft and Borland ones),
r causes the file to be opened in text mode by default (there are
ways to override that at compile time and/or run time), and this
causes the ^Z to be treated as an EOF (there might be ways to
override that too).

Re: Bug report: 1) Small error 2) Improvement to Manual

2002-01-21 Thread Ian Abbott


On 21 Jan 2002 at 14:56, Thomas Lussnig wrote:

 Why not just open the wgetrc file in text mode using
 fopen(name, r) instead of rb? Does that introduce other
 problems?
 I think it has to do with comments because the defeinition is that 
 starting with '#'  the rest of the line
 is ignored. And an line ends with '\n' or the end of the file and not 
 with and spezial charakter '\0' that
 mean for me that to abort the reading of an textfile when zero isfound 
 mean's incorrect parsing.

(N.B. the control-Z character would be '\032', not '\0'.)

So maybe just mention in the documentation that the wgetrc file is
considered to be a plain text file, whatever that means for the
system Wget is running on. Maybe mention peculiaries of
DOS/Windows, etc.

In general, it is more portable to read or write native text files
in text mode as it performs whatever local conversions are
necessary to make reads and writes of text files appear like UNIX
i.e. each line of text terminated by a newline '\n'). In binary
mode, what you get depends on the system (Mac text files have lines
terminated by carriage return ('\r') for example, and some systems
(VMS?) don't even have line termination characters as such.)

In the case of Wget, log files are already written in text mode. I
think wgetrc needs to be read in text mode and that's an easy
change.

In the case of the --input-file option, ideally the input file
should be read in text mode unless the --force-html option is used,
in which case it should be read in the same mode as when parsing
other locally-stored HTML files.

Wget stores retrieved files in binary mode but the mode used when
reading those locally-stored files is less precise (not that it
makes much difference for UNIX). It uses open() (not fopen()) and
read() to read those files into memory (or uses mmap() to map them
into memory space if supported). The DOS/Windows version of open()
allows you to specify text or binary mode, defaulting to text mode,
so it looks like the Windows version of Wget saves html files in
binary mode and reads them back in in text mode! Well whatever -
the HTML parser still seems to work okay on Windows, probably
because HTML isn't that fussy about line-endings anyway!

So to support --input-file portably (not the --force-html version),
the get_urls_file() function in url.c should probably call a new
function read_file_text() (or read_text_file() instead of
read_file() as it does at the moment. For UNIX-type systems, that
could just fall back to calling read_file().

The local HTML file parsing stuff should probably be left well
alone but possibly add some #ifdef code for Windows to open the
file in binary mode, though there may be differences between
compilers for that.

Re: Passwords and cookies

2002-01-18 Thread Ian Abbott


On 17 Jan 2002 at 18:17, Hrvoje Niksic wrote:

 Ian Abbott [EMAIL PROTECTED] writes:
  I'm also a little worried about the (time_t *)cookie-expiry_time
  cast, as cookie-expiry time is of type unsigned long. Is a time_t
  guaranteed to be the same size as an unsigned long?
 
 It's not, but I have a hard time imagining an architecture where
 time_t will be *larger* than unsigned long.

I received an email from Csaba Ráduly which I hope he won't mind me
quoting here:

On 17 Jan 2002 at 12:45, [EMAIL PROTECTED] wrote:
 Very few may care, but IBM's C/C++ compilers v 3.6.5
 typedef time_t as ... double !
 
 Shouldn't cookie-expiry_time be declared as time_t ?

Re: Passwords and cookies

2002-01-17 Thread Ian Abbott


On 16 Jan 2002 at 17:50, Hrvoje Niksic wrote:

 Wget's strptime implementation comes from an older version of glibc.
 Perhaps we should simply sync it with the latest one from glibc, which
 is obviously capable of handling it?

That sounds like a good plan.

Re: Passwords and cookies

2002-01-17 Thread Ian Abbott


On 16 Jan 2002 at 17:45, Hrvoje Niksic wrote:

 Aside from google, ~0UL is Wget's default value for the expiry time,
 meaning the cookie is non-permanent and valid throughout the session.
 Since Wget sets the value, Wget should be able to print it in DEBUG
 mode.
 
 Do you think this patch would fix the printing problem:
 
 Index: src/cookies.c
 ===
 RCS file: /pack/anoncvs/wget/src/cookies.c,v
 retrieving revision 1.18
 diff -u -r1.18 cookies.c
 --- src/cookies.c 2001/12/10 02:29:11 1.18
 +++ src/cookies.c 2002/01/16 16:43:21
 @@ -241,7 +241,9 @@
  cookie-domain, cookie-port, cookie-path,
  cookie-permanent ? permanent : nonpermanent,
  cookie-secure,
 -asctime (localtime ((time_t *)cookie-expiry_time)),
 +(cookie-expiry_time != ~0UL ?
 + asctime (localtime ((time_t *)cookie-expiry_time))
 + : UNKNOWN),
  cookie-attr, cookie-value));
  }

Yes, except for any other values of cookie-expiry_time that would
cause localtime() to return a NULL pointer (in the case of Windows,
anything before 1970). Perhaps the return value of localtime()
should be checked before passing it to asctime() as in the
modified version of your patch I have attached below.

I'm also a little worried about the (time_t *)cookie-expiry_time
cast, as cookie-expiry time is of type unsigned long. Is a time_t
guaranteed to be the same size as an unsigned long?

Index: src/cookies.c
===
RCS file: /pack/anoncvs/wget/src/cookies.c,v
retrieving revision 1.18
diff -u -r1.18 cookies.c
--- src/cookies.c   2001/12/10 02:29:11 1.18
+++ src/cookies.c   2002/01/17 11:29:00
@@ -184,6 +184,9 @@
   struct cookie *chain_head;
   char *hostport;
   char *chain_key;
+#ifdef DEBUG
+  struct tm *local_expiry;
+#endif
 
   if (!cookies_hash_table)
 /* If the hash table is not initialized, do so now, because we'll
@@ -241,7 +244,10 @@
   cookie-domain, cookie-port, cookie-path,
   cookie-permanent ? permanent : nonpermanent,
   cookie-secure,
-  asctime (localtime ((time_t *)cookie-expiry_time)),
+  (cookie-expiry_time != ~0UL 
+   NULL != (local_expiry = localtime ((time_t *)cookie-expiry_time))
+   ?  asctime (local_expiry)
+   : UNKNOWN),
   cookie-attr, cookie-value));
 }

RE: Mapping URLs to filenames

2002-01-16 Thread Ian Abbott


On 16 Jan 2002 at 8:02, David Robinson (AU) wrote:

 In the meantime, however, '?' is problematic for Win32 users. It stops WGET
 from working properly whenever it is found within a URL. Can we fix it
 please.

My proposal for using escape sequences in filenames for problem
characters is up for discussion at the moment, but I'm not sure if
they really need to be reversible (except that it helps to reduce
the chances of different URLs being saved to the same filename).

Would it be sufficient to map all illegal characters to '@'? For
Windows, the code already changes '%' to '@' and it could just as
easily change '*', '?', etc. to '@' as well.

Re: Passwords and cookies

2002-01-16 Thread Ian Abbott


On 15 Jan 2002 at 14:48, Brent Morgan wrote:

 Thanks to everyone for looking at this problem.  I am not a developer
 and at my wits end with this problem.  I did determine with a different
 cookie required site that it is still not working.  

Could you change line 1017 of cmpt.c to read as follows:

get_number (0, 2038);

(i.e. change 2036 to 2038). Then recompile. That might be enough to
stop the wget from crashing with the -d option.

If debugging now works, can you supply some debug log output for
your Set-Cookie problem?

 I will keep my eye for future windows compilations and keep trying.

That relies on having decent information to debug the problem.

A strange bit of HTML

2002-01-16 Thread Ian Abbott


I came across this extract from a table on a website:

td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a
href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg');
onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img
SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td

Note the string beginning msover1(, which seems to be an
attribute value without a name, so that makes it illegal HTML.

I haven't traced what Wget is actually doing when it encounters
this, but it doesn't treat 66B27885.htm as a URL to be
downloaded.

I can't call this a bug, but is Wget doing the right thing by
ignoring the href altogether?

Re: Passwords and cookies

2002-01-15 Thread Ian Abbott


On 15 Jan 2002 at 0:27, Hrvoje Niksic wrote:

 Brent Morgan [EMAIL PROTECTED] writes:
 
  The -d debug option crashes wget just after it reads the input file.
 
 Huh?  Ouch!  Wget on Windows is much less stable than I imagined.  Can
 you run it under a debugger and see what causes the crash?

I had ago at building wget 1.8.1 myself on Windows 2000 with VC 6.0
and also got the crash when using the -d option, so I upgraded to
VC 6.0 SP2 and it did the same thing.

I've narrowed it down to the following line in cookies.c

   asctime (localtime ((time_t *)cookie-expiry_time)),

which is part of a DEBUGP macro call from function store_cookie.
Specifically, it was failing on the asctime call, rather than the
localtime call, but that's as far as I got. A casual glance at C
runtime library source supplied with the compiler revealed no
obvious problem, but I'll try and investigate this problem a bit
more.

Mapping URLs to filenames

2002-01-15 Thread Ian Abbott


This is an initial proposal for naming the files and directories
that Wget creates, based on the URLs of the retrieved documents.

At the moment there are many complaints about Wget failing to save
documents which have '?' in their URLs when running under Windows,
for example. In general, the set of illegal characters in
file-names depends on the the operating system and the file-system
in use. Wget can be compiled for different operating systems, but
doesn't know which file-system is being used - you may get the
oddball who wants to save files to a vfat file-system from Linux
for example! Therefore, there should be some way to override or
augment the set of illegal filename characters using a wgetrc
command, for example.

File-names used within the internals of Wget need to be converted
to an external form which deals with illegal characters or illegal
sequences of characters in the file-name. The internal filename
consists of directory separators ('/'), illegal characters, a
nominated 'escape' character and other (legal) characters.

Illegal characters in the internal file-name can be mapped to an
escape sequence in the external file-name, consisting of the escape
character followed by two hex digits (it is assumed that both the
escape character and the hex digits are legal file-name characters
for the operating system and file-system in use!). Escape
characters in the internal file-name can be mapped to an escape
sequence in the same way.

The directory separator character ('/') in the internal file-name
is usually mapped to the directory hierachy on the file-system, but
if the internal file-name contains two or more consecutive
directory separator characters, some of these will need to be
escaped to avoid trying to create directories with null names. (An
alternate solution is to create a directory whose name consists
solely of a single escape character.)

The external file-names are easily reversible back to the internal
form when necessary.

The obvious candidate for the escape character is the '%'
character, although the escape mechanism for file-names is
logically distinct from the escape mechanism for HTTP. The current
version of Wget for Windows remaps all '%' characters to '@', so
perhaps '@' is a better candidate for the escape character for
Windows. (I'm not sure why Wget does this, as '%' seems to be a
legal file-name character for Windows and MS-DOS. Perhaps it is
for usability reasons due to the command shell's variable
interpolation of '%name%' sequences.) The escape character can be
made operating system dependent, and perhaps could be overridden
with a wgetrc command.

That's my initial proposal anyway. I'm not sure about things such
as UTF-8 should be handled, or if that's an issue at all.

Re: 2 Gb limitation

2002-01-11 Thread Ian Abbott


On 10 Jan 2002 at 17:09, Matt Butt wrote:

 I've just tried to download a 3Gb+ file (over a network using HTTP) with
 WGet and it died at exactly 2Gb.  Can this limitation be removed?

In principle, changes could be made to allow wget to be 
configured
for large file support, by using the appropriate data 
types (i.e.
'off_t' instead of 'long').

The logging code would be more complicated as there is 
no portable
way to handle the data type in a printf-style function, 
so these
would have to be converted to strings by a bespoke 
routine and the
converted strings passed to the printf-style function. 
This would
also slow down the operation of wget a little bit.

A version of wget configured for large file support 
would also be
slower in general than a version not configured for 
large file
support - at least on a 32-bit machine.

Large file support should probably be added to the TODO 
list at
least. Quite a few people use wget to download .iso 
images of
CD-ROMs at the moment; in the future, those same people 
are
likely to want to use wget to download DVD-ROM images!

Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?

2002-01-11 Thread Ian Abbott


On 11 Jan 2002 at 10:51, Picot Chappell wrote:

 Thanks for your response.  I tried the same command, using your URL, and it
 worked fine.  So I took a look at the site I was retrieving for the failed
 test.
 
 It's a ssl site (didn't think about it before) and I noticed 2 things.  The
 Frame source pages were not downloaded (they were for www.mev.co.uk) and the
 links were converted to full URLs.
 ie. FRAME src=menulayer.cgi. became FRAME
 src=https://www.someframed.page/menulayer.cgi; ...
 
 So the content was still reachable, but not really local (this is the original
 problem).  I tried it without the --convert-links, and the frame source
 remained defined as menulayer.cgi  but menulayer.cgi was not downloaded.
 
 Do you think this might be an issue with framesets and ssl sites?  or an issue
 with framesets and cgi source files?

Do you have SSL support compiled in?

Also it is possible that the .cgi script on the server is checking
HTTP request headers and cookies, doesn't like what it sees and is
returning an error. It is sometimes useful to lie to the server 
about the HTTP user agent using the -U option, e.g.:

-U Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

or include something similar in the wgetrc file:

useragent = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

Some log entries would be useful, particularly with the -d option.
You can mask any sensitive bits of the log if you want.

Re: Simplest logfile ?

2002-01-08 Thread Ian Abbott


On 8 Jan 2002 at 20:31, Mike wrote:

 What I'm looking for is something like the way FTP_Lite operates,
 
 Can I nominate a single log file in the wgetrc for use by all the
 wget processes that spawn off from my bash ?

There is the -a FILE (--append-output=FILE) option to append to a
logfile. A combination of -b, -nv and -a FILE should do more or
less what you want.

It may be possible for the log file to become mangled if more than
one wget process writes to the log file at the same time, but I
don't think that will be a problem with -nv logging unless the
lines are very long (due to downloading from a humungous URL, say).

Re: Asseertion failed in wget

2002-01-07 Thread Ian Abbott


On 7 Jan 2002 at 11:52, Jan Starzynski wrote:

 for GNU Wget 1.8 I get the following assertion failed message:
 
 get: progress.c:673: create_image: Zusicherung »p - bp-buffer = bp-width«
 nicht erfüllt.
(snip)
 In the changelogs of 1.8.1 I could not find a hint that this has been fixed
 until now.

It _has_ been fixed in 1.8.1, but the ChangeLog entry only mentions
the bug that was fixed, not its symptoms.

FWIW, here is the ChangeLog entry (complete with typo:-)

2001-12-09  Hrvoje Niksic  [EMAIL PROTECTED]

* progress.c (create_image): Fix ETA padding when hours are prined.

Re: wget does not treat urls starting with // correctly

2002-01-07 Thread Ian Abbott


On 4 Jan 2002 at 12:22, Bastiaan Stougie wrote:

  wget -P $LOCALDIR -m -np -nH -p --cut-dirs=2 
 http://host/dir1/dir2/
 
 This works fine, except that wget does not follow all the urls. It 
 skips urls like:
 
  A HREF=//host/dir1/dir2/filetext/A

Here is a proposed patch to fix that.

src/ChangeLog entry:

2002-01-07  Ian Abbott [EMAIL PROTECTED]

* url.c (uri_merge_1): Deal with net path relative URL (one that
   starts with //).

And the actual patch:

Index: src/url.c
===
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.67
diff -u -r1.67 url.c
--- src/url.c   2001/12/14 15:45:59 1.67
+++ src/url.c   2002/01/07 15:30:41
@@ -1575,6 +1575,35 @@
  memcpy (constr + baselength, link, linklength);
  constr[baselength + linklength] = '\0';
}
+  else if (linklength  1  *link == '/'  *(link + 1) == '/')
+   {
+ /* LINK begins with // and so is a net path: we need to
+replace everything after (and including) the double slash
+with LINK.
+
+So, if BASE is http://oldhost/whatever/foo/bar;, and LINK
+is //newhost/qux/xyzzy, our result should be
+http://newhost/qux/xyzzy;.  */
+ int span;
+ const char *slash;
+ const char *start_insert;
+ /* Look for first slash. */
+ slash = memchr (base, '/', end - base);
+ /* If found slash and it is a double slash, then replace
+from this point,
+else default to replacing from the beginning.  */
+ if (slash  *(slash + 1) == '/')
+   start_insert = slash;
+ else
+   start_insert = base;
+
+ span = start_insert - base;
+ constr = (char *)xmalloc (span + linklength + 1);
+ if (span)
+   memcpy (constr, base, span);
+ memcpy (constr + span, link, linklength);
+ constr[span + linklength] = '\0';
+   }
   else if (*link == '/')
{
  /* LINK is an absolute path: we need to replace everything

Re: [no subject]

2002-01-04 Thread Ian Abbott


On 3 Jan 2002 at 13:58, Henric Blomgren wrote:

 Wget-bug:
 
 GNU Wget 1.8
[...]
 [root@MAGI .temporary]# wget: progress.c:673: create_image: Assertion `p -
 bp-buffer = bp-width' failed.

Please use Wget 1.8.1. That bug has already been fixed!

Re: Wget 1.8.1-pre2 Problem with -i, -r and -l

2001-12-19 Thread Ian Abbott


On 18 Dec 2001 at 23:13, Hrvoje Niksic wrote:

 Ian Abbott [EMAIL PROTECTED] writes:
 
  If I have a website http://somesite/ with three files on it:
  index.html, a.html and b.html, such that index.html links only to
  a.html and a.html links only to b.html then the following command
  will retrieve all three files:
  
wget -r -l 1 http://somesite/index.html http://somesite/a.html
 
 Does it?  For me this command retrieves only `index.html' and
 `a.html', and that's a bug.  `-i list' makes no different.

Well that's how it behaved for me, but actually I was using
pre2+cvs (src/CVS/Entries at [1]). Another difference was that when
the URLs were specified on the command-line, a.html was downloaded
twice.

I repeated the test with make distclean,
./configure --with-ssl, make and it behaved the same.

With your latest CVS updates (see [2]) the -i option now behaves
correctly - i.e. it downloads all three files. However, the command
which specified index.html and a.html on the command-line still
downloads a.html twice.



[1] Here is the src/CVS/Entries I used for the behavior I
originally observed:

/alloca.c/1.1.1.1/Thu Dec  2 07:42:27 1999//
/ansi2knr.c/1.1.1.1/Thu Dec  2 07:42:26 1999//
/fnmatch.c/1.2/Sun May 27 19:34:56 2001//
/getopt.c/1.1.1.1/Thu Dec  2 07:42:26 1999//
/getopt.h/1.1.1.1/Thu Dec  2 07:42:26 1999//
/init.h/1.2/Sun May 27 19:35:04 2001//
/rbuf.h/1.4/Sun May 27 19:35:09 2001//
/safe-ctype.c/1.1/Fri Mar 30 22:36:59 2001//
/safe-ctype.h/1.2/Fri Apr 27 05:03:08 2001//
D/ChangeLog-branches
/gnu-md5.c/1.1/Sun Nov 18 04:36:20 2001//
/gnu-md5.h/1.1/Sun Nov 18 04:36:20 2001//
/hash.c/1.14/Tue Nov 20 11:47:32 2001//
/headers.c/1.6/Tue Nov 20 11:47:32 2001//
/html-parse.c/1.9/Tue Nov 20 11:47:32 2001//
/ftp.h/1.11/Thu Nov 22 10:36:15 2001//
/Makefile.in/1.17/Mon Nov 26 10:46:17 2001//
/recur.h/1.4/Mon Nov 26 10:46:21 2001//
/retr.h/1.10/Mon Nov 26 18:11:36 2001//
/connect.c/1.11/Tue Nov 27 15:37:00 2001//
/mswindows.h/1.5/Thu Nov 29 15:57:38 2001//
/connect.h/1.6/Fri Nov 30 10:28:03 2001//
/cookies.h/1.4/Fri Nov 30 10:28:03 2001//
/fnmatch.h/1.3/Fri Nov 30 10:28:03 2001//
/ftp-opie.c/1.6/Fri Nov 30 10:28:03 2001//
/gen-md5.c/1.2/Thu Nov 29 18:48:42 2001//
/gen-md5.h/1.2/Thu Nov 29 18:55:52 2001//
/hash.h/1.5/Fri Nov 30 10:28:03 2001//
/headers.h/1.4/Fri Nov 30 10:28:03 2001//
/html-parse.h/1.3/Fri Nov 30 10:28:03 2001//
/netrc.c/1.10/Fri Nov 30 10:28:03 2001//
/netrc.h/1.3/Fri Nov 30 10:28:03 2001//
/options.h/1.24/Fri Nov 30 10:28:03 2001//
/res.h/1.3/Fri Nov 30 10:28:04 2001//
/cmpt.c/1.10/Fri Nov 30 13:11:47 2001//
/sysdep.h/1.19/Fri Nov 30 10:28:05 2001//
/ftp.c/1.52/Mon Dec  3 19:13:15 2001//
/rbuf.c/1.6/Wed Dec  5 11:16:05 2001//
/gen_sslfunc.h/1.6/Thu Dec  6 10:23:13 2001//
/progress.h/1.4/Thu Dec  6 10:23:14 2001//
/snprintf.c/1.6/Wed Dec  5 11:16:09 2001//
/url.h/1.22/Thu Dec  6 10:23:14 2001//
/config.h.in/1.20/Mon Dec 10 11:30:41 2001//
/cookies.c/1.18/Mon Dec 10 11:30:41 2001//
/ftp-basic.c/1.15/Mon Dec 10 11:30:41 2001//
/log.c/1.11/Mon Dec 10 11:30:42 2001//
/main.c/1.68/Mon Dec 10 11:30:42 2001//
/mswindows.c/1.8/Mon Dec 10 11:30:42 2001//
/progress.c/1.23/Mon Dec 10 11:30:42 2001//
/wget.h/1.31/Mon Dec 10 11:30:43 2001//
/ftp-ls.c/1.22/Tue Dec 11 11:37:06 2001//
/host.c/1.32/Tue Dec 11 11:37:06 2001//
/host.h/1.8/Tue Dec 11 11:37:06 2001//
/html-url.c/1.22/Thu Dec 13 10:47:32 2001//
/res.c/1.6/Thu Dec 13 10:47:32 2001//
/init.c/1.44/Mon Dec 17 10:52:57 2001//
/url.c/1.67/Mon Dec 17 10:52:57 2001//
/gen_sslfunc.c/1.15/Tue Dec 18 11:32:42 2001//
/http.c/1.82/Mon Dec 17 19:56:57 2001//
/retr.c/1.49/Tue Dec 18 11:32:42 2001//
/utils.c/1.43/Tue Dec 18 11:32:42 2001//
/utils.h/1.17/Tue Dec 18 11:32:42 2001//
/version.c/1.26/Tue Dec 18 11:32:42 2001//
/ChangeLog/1.333/Tue Dec 18 18:59:29 2001//
/recur.c/1.38/Result of merge//

N.B. Although the entry for recur.c says Result of merge, it is
in fact identical to -r1.38 on the server.



[2] Here is the updated src/CVS/Entries after your fixes:

/alloca.c/1.1.1.1/Thu Dec  2 07:42:27 1999//
/ansi2knr.c/1.1.1.1/Thu Dec  2 07:42:26 1999//
/fnmatch.c/1.2/Sun May 27 19:34:56 2001//
/getopt.c/1.1.1.1/Thu Dec  2 07:42:26 1999//
/getopt.h/1.1.1.1/Thu Dec  2 07:42:26 1999//
/init.h/1.2/Sun May 27 19:35:04 2001//
/rbuf.h/1.4/Sun May 27 19:35:09 2001//
/safe-ctype.c/1.1/Fri Mar 30 22:36:59 2001//
/safe-ctype.h/1.2/Fri Apr 27 05:03:08 2001//
D/ChangeLog-branches
/gnu-md5.c/1.1/Sun Nov 18 04:36:20 2001//
/gnu-md5.h/1.1/Sun Nov 18 04:36:20 2001//
/hash.c/1.14/Tue Nov 20 11:47:32 2001//
/headers.c/1.6/Tue Nov 20 11:47:32 2001//
/ftp.h/1.11/Thu Nov 22 10:36:15 2001//
/Makefile.in/1.17/Mon Nov 26 10:46:17 2001//
/recur.h/1.4/Mon Nov 26 10:46:21 2001//
/retr.h/1.10/Mon Nov 26 18:11:36 2001//
/connect.c/1.11/Tue Nov 27 15:37:00 2001//
/mswindows.h/1.5/Thu Nov 29 15:57:38 2001//
/connect.h/1.6/Fri Nov 30 10:28:03 2001//
/cookies.h/1.4/Fri Nov 30 10:28:03 2001//
/fnmatch.h/1.3/Fri Nov 30 10:28:03 2001//
/ftp-opie.c/1.6/Fri Nov 30 10:28:03 2001

Re: Error while compiling Wget 1.8.1-pre2+cvs.

2001-12-19 Thread Ian Abbott


On 19 Dec 2001 at 17:40, Alexey Aphanasyev wrote:

 Hrvoje Niksic wrote:
  The `gnu-md5.o' object is missing.  Can you show us the output from
  `configure'?

 Yes, sure. Please find it attached bellow.

Have you tried running make distclean before ./configure? It is
possible that some of your cached configuration results have become
stale.

Wget 1.8+CVS not passing referer for recursive retrieval

2001-12-18 Thread Ian Abbott


Although retrieve_tree() stores and retrieves referring URLs in the
URL queue, it does not pass them to retrieve_url(). This seems to
have got lost during the transition from depth-first to breadth-
first retrieval.

This means that HTTP requests for URLs being retrieved at depth
greater than 0 have the Referer set to that set by the --referer
option or nothing at all, and not necessarily the URL of the
referring page.


src/ChangeLog entry:

2001-12-18  Ian Abbott  [EMAIL PROTECTED]

* recur.c (retrieve_tree): Pass on referring URL when retrieving
recursed URL.

Index: src/recur.c
===
RCS file: /pack/anoncvs/wget/src/recur.c,v
retrieving revision 1.37
diff -u -r1.37 recur.c
--- src/recur.c 2001/12/13 19:18:31 1.37
+++ src/recur.c 2001/12/18 13:28:58
@@ -237,7 +237,7 @@
  int oldrec = opt.recursive;
 
  opt.recursive = 0;
- status = retrieve_url (url, file, redirected, NULL, dt);
+ status = retrieve_url (url, file, redirected, referer, dt);
  opt.recursive = oldrec;
 
  if (file  status == RETROK

Wget 1.8.1-pre2 Problem with -i, -r and -l

2001-12-18 Thread Ian Abbott


I don't have time to look at this problem today, but I thought I'd
mention it now to defer the 1.8.1 release.

If I have a website http://somesite/ with three files on it:
index.html, a.html and b.html, such that index.html links only to
a.html and a.html links only to b.html then the following command
will retrieve all three files:

  wget -r -l 1 http://somesite/index.html http://somesite/a.html

However, if I then create a file 'list' containing the lines:

  http://somesite/index.html
  http://somesite/a.html

and issue the command:

  wget -r -l 1 -i list

then only index.html and a.html are retrieved. I think wget should
also retrieve b.html which is linked to by b.html, i.e. treat the
URLs in the file as though they were specified on the command line.

Re: A small bug

2001-12-14 Thread Ian Abbott


On 14 Dec 2001 at 14:49, Peng GUAN wrote:

 Maybe a bug in file fnmatch.c, line 54:
 
 ( n==string || (flags  FNM_PATHNAME)  n[-1] == '/'))
 
 the n[-1] should be change to *(n-1).

I like the easy ones. Those are equivalent in C. As to which of the
too looks the nicest is a matter of aesthetics and also depends on
the style of the surronding source code. At least both of the above
look nicer than (-1)[n] which is also equivalent to the above,
but its usage is reserved for ubfuscated C coding competitions!

Re: Is wget --timestamping URL working on Windows 2000?

2001-12-12 Thread Ian Abbott


On 11 Dec 2001 at 18:40, [EMAIL PROTECTED] wrote:

 It seems to me that if an output_document is specified, it is being
 clobbered at the very beginning (unless always_rest is true). Later in
 http_loop stat() comes up with zero length. Hence there's always a size
 mismatch when --output-document is specified.
 
 That doesn't sound good to me...

But it's as documented in the man page. The option is meant for
concatenating several pages into one big file, and you can't
meaningfully compare timestamps or file sizes in that case.

Re: log errors

2001-12-11 Thread Ian Abbott


On 11 Dec 2001 at 16:09, Hrvoje Niksic wrote:

 Summer Breeze [EMAIL PROTECTED] writes:
  Here is a sample entry:
  
  66.28.29.44 - - [08/Dec/2001:18:21:20 -0500] GET /index4.html%0A
  HTTP/1.0 403 280 - Wget/1.6
 
 /index4.html%0A looks like a page is trying to link to /index4.html,
 but the link contains a trailing newline.

If that is the case, you may be able to track down the referring
page if that is also logged.

Another possibility is that someone is running a (UNIX) command
like this:

  $ wget 'http://motherbird.com/index4.html
   '

(The '$' and '' in the above are just shell prompts, not part of
the command.)

I just tried that myself and saw that Wget was trying to retrieve
http://motherbird.com/index4.html%0A; as in your log file and got
an ERROR 403: Forbidden back.

Re: Make -p work with framed pages.

2001-12-03 Thread Ian Abbott


On 1 Dec 2001 at 4:04, Hrvoje Niksic wrote:

 As a TODO entry summed up:
 
 * -p should probably go _two_ more hops on FRAMESET pages.

More generally, I think it probably needs to be made to work for
nested framesets too.

Re: windows patch and problem

2001-11-29 Thread Ian Abbott


On 29 Nov 2001 at 12:48, Herold Heiko wrote:

 --12:27:26--  http://www.cnn.com/
   (try: 3) = `www.cnn.com/index.html'
 Found www.cnn.com in host_name_addresses_map (008D01B0)
 Releasing 008D01B0 (new refcount 1).
 Retrying.
 
 (ecc.)
 Same with other hosts
 
 Could somebody please confirm if this is a problem with my build ?

No, it happens on my Linux build too. Something broke.

Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)

2001-11-29 Thread Ian Abbott


On 29 Nov 2001 at 13:14, Daniel Stenberg wrote:

 On Thu, 29 Nov 2001, Maciej W. Rozycki wrote:
 
  On Wed, 28 Nov 2001, Ian Abbott wrote:
 
   However, the Linux man page for bcopy(3) do not say the strings can
   overlap
 
   Presumably the man page is incorrect

Yes, I think so.

 Well, can we actually guarantee that bcopy() will work on all platforms where
 memmove() is not present?

HAVE_BCOPY?

 I wouldn't be so bold to say that. I'd vote for a separate implemenation. But
 that's just me.

That's the easiest thing to do. It's only used at the moment for removing duplicate 
outgoing 
cookies. I don't know how often you get duplicate cookies, and the current mechanism 
for 
removing them isn't all that efficient when there are multiple duplicates to be 
removed anyway!

Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)

2001-11-29 Thread Ian Abbott


On 29 Nov 2001 at 14:40, Hrvoje Niksic wrote:

 Ian, can you clarify what you meant by BSD man pages?  Which BSD?

NetBSD: http://www.tac.eu.org/cgi-bin/man-cgi?bcopy+3
OpenBSD: http://www.openbsd.org/cgi-bin/man.cgi?query=bcopysektion=3
FreeBSD: http://www.freebsd.org/cgi/man.cgi?query=bcopysektion=3

Those are all pretty much identical and say that the strings can
overlap and that a bcopy function appeared in BSD4.2.

SunOS 4.1.3: 
http://www.freebsd.org/cgi/man.cgi?query=bcopysektion=3manpath=SunOS+4.1.3

That one aliases SunOS 4.1.3's bstring(3) man page which describes
a group of related functions (including bcopy). It also says the
strings can overlap.

Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)

2001-11-28 Thread Ian Abbott


On 28 Nov 2001 at 18:08, Hrvoje Niksic wrote:

 Daniel Stenberg [EMAIL PROTECTED] writes:
 
  On Wed, 28 Nov 2001, zefiro wrote:
  
  ld: Undefined symbol
 _memmove
 
  Do you have any suggestion ?
  
  SunOS 4 is known to not have memmove.
  
  May I suggest adding the following (or similiar) to a relevant wget source
  file:
 [...]
 
 Thanks for the suggestion and the code example.  Two points, though:
 
 * Isn't it weird that the undefined symbol is _memmove, not memmove?
   It looks as if a header file is translating the symbol, thinking
   that _memmove exists.

Not really. UNIX C compilers of old prefix C external symbols with '_'. GCC doesn't do 
that unless targetted for a system that uses the prefix in its standard system library 
symbols.

 * As a BSD offshoot, SunOS almost certainly has bcopy.  Could we make
   use of it?  I seem to remember reading that BSD bcopy is supposed to
   handle overlapping blocks, but I cannot find a confirmation right
   now.  If that were the case, we could simply use this:
 
 #ifndef HAVE_MEMMOVE
 # define memmove(to, from, len) bcopy(from, to, len)
 #endif

That ought to work as the SunOS and BSD man pages say that the strings can overlap.

However, the Linux man page for bcopy(3) do not say the strings can overlap and in 
fact 
suggest that it be replaced with memcpy in new programs! Linux has memmove so that 
does 
not matter, but perhaps rolling our own memmove as Daniel suggested would be the 
safest 
option. Another difference between bcopy() and memmove() is that bcopy() returns void 
whereas memmove returns a pointer, but in the one place in the Wget source where 
memmove() is called, the return value is not used.

Re: HAVE_RANDOM ?

2001-11-27 Thread Ian Abbott


On 27 Nov 2001, at 15:16, Hrvoje Niksic wrote:

 So, does anyone know about the portability of rand()?

It's in the ANSI/ISO C spec (ISO 9899). It's always been in UNIX 
(or at least it's been in there since UNIX 7th Edition), and I 
should think it's always been in the MS-DOS compilers, but I don't 
have one handy at the moment.

It tends not to be very random in some implementations, but should 
be good enough to implement a random wait.

wget-1.8-dev Segmentation fault when retrieving from file

2001-11-27 Thread Ian Abbott


I got a segmentation fault when retrieving URLs from a file.

2001-11-27  Ian Abbott [EMAIL PROTECTED]

* retr.c (retrieve_from_file): Initialize `new_file' to NULL to
prevent seg fault.

Index: src/retr.c
===
RCS file: /pack/anoncvs/wget/src/retr.c,v
retrieving revision 1.41
diff -u -r1.41 retr.c
--- src/retr.c  2001/11/26 20:07:13 1.41
+++ src/retr.c  2001/11/27 18:31:12
@@ -538,7 +538,7 @@
 
   for (cur_url = url_list; cur_url; cur_url = cur_url-next, ++*count)
 {
-  char *filename = NULL, *new_file;
+  char *filename = NULL, *new_file = NULL;
   int dt;
 
   if (cur_url-ignore_when_downloading)

Re: Does the -Q quota command line argument work?

2001-11-27 Thread Ian Abbott


On 27 Nov 2001 at 13:07, John Masinter wrote:

 It seems that wget will download an entire large file regardless of what
 I specify for the quota. For example I am trying to download only the
 first 100K of a 800K file. I specify this:
 
 wget -Q 100K http://url-goes-here
 
 It then proceeds to download the entire 800K file. I've also tried using
 the --quota=100K form as well as -Q 10 and nothing seems to work.
 
 Did I misinterpret the purpose of this argument?

Yes, as it says in the manual, the quota will never affect downloading a single 
file.

You may wish to try out the new --range option in wget 1.8-dev (available via 
anonymous CVS), or wait until wget 1.8 comes out.

1 2 >

1 - 100 of 144 matches

Mail list logo