Re: Wget 1.8.2 is released
On Thu, 30 May 2002 03:43:06 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: Ian Abbott [EMAIL PROTECTED] writes: This is a bit late, Sorry it didn't make it in. I guess we could publish it on the web site, so that people who wish to compile 1.8.2 with Borland C++ can do so. Heiko's Wget on Windows page is another good place to link to this patch. I'll clean it up a bit. +# ifdef NO_ANONYMOUS_STRUCT + wt-wintime.u.HighPart = ft.dwHighDateTime; + wt-wintime.u.LowPart = ft.dwLowDateTime; +# else wt-wintime.HighPart = ft.dwHighDateTime; wt-wintime.LowPart = ft.dwLowDateTime; +# endif Isn't anonymous struct a C++ feature? (I'm only guessing here.) Yes, but some C compilers support it as an extension. Would wt-wintime.u.HighPart work under both compilers? I'm just asking as someone who would like to see the number of #ifdefs decrease rather than increase. Microsoft only document the anonymous form in their Win32 SDK, which is why I'm hesitant to take it out altogether. However, the undocumented, non-anonymous u. form does seem to work uniformly, at least with the Microsoft, Borland and Watcom compilers I've tried.
Re: Wget 1.8.2 is released
On Wed, 29 May 2002 05:14:14 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: Wget 1.8.2, a bugfix release of Wget, has been released, and is now available from the GNU ftp site: ftp://ftp.gnu.org/pub/gnu/wget/wget-1.8.2.tar.gz This is a bit late, but here is a patch to compile it with Borland C++ 4.5 (compiler version 4.5.2). With a small change to the Makefile to select a different linker, it also compiles with Borland C++ 5.5 (compiler version 5.5.1). The Makefile to change is windows/Makefile.src.bor before running configure --borland, or alternatively change src/Makefile after running configure --borland. diff -ru wget-1.8.2/src/utils.c wget-1.8.2.new/src/utils.c --- wget-1.8.2/src/utils.c Sat May 18 04:05:22 2002 +++ wget-1.8.2.new/src/utils.c Mon May 27 19:44:40 2002 @@ -1504,8 +1504,13 @@ SYSTEMTIME st; GetSystemTime (st); SystemTimeToFileTime (st, ft); +# ifdef NO_ANONYMOUS_STRUCT + wt-wintime.u.HighPart = ft.dwHighDateTime; + wt-wintime.u.LowPart = ft.dwLowDateTime; +# else wt-wintime.HighPart = ft.dwHighDateTime; wt-wintime.LowPart = ft.dwLowDateTime; +# endif #endif } @@ -1533,8 +1538,13 @@ ULARGE_INTEGER uli; GetSystemTime (st); SystemTimeToFileTime (st, ft); +# ifdef NO_ANONYMOUS_STRUCT + uli.u.HighPart = ft.dwHighDateTime; + uli.u.LowPart = ft.dwLowDateTime; +# else uli.HighPart = ft.dwHighDateTime; uli.LowPart = ft.dwLowDateTime; +# endif return (long)((uli.QuadPart - wt-wintime.QuadPart) / 1); #endif } diff -ru wget-1.8.2/windows/Makefile.src.bor wget-1.8.2.new/windows/Makefile.src.bor --- wget-1.8.2/windows/Makefile.src.bor Tue Dec 4 10:33:18 2001 +++ wget-1.8.2.new/windows/Makefile.src.bor Wed May 29 12:20:51 2002 @@ -2,17 +2,25 @@ ## Makefile for use with watcom win95/winnt executable. CC=bcc32 + +## Please choose the linker used by your compiler + +## Linker for Borland C++ 5.5 +#LINK=ilink32 + +## Linker for Borland C++ 4.5 LINK=tlink32 LFLAGS= -CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w- +CFLAGS=-DWINDOWS=1 -DHAVE_CONFIG_H -I. -H -H=wget.csm -w- ## variables -OBJS=cmpt.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj \ - ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html.obj \ - http.obj init.obj log.obj main.obj gnu-md5.obj netrc.obj rbuf.obj \ - alloca.obj \ - recur.obj res.obj retr.obj url.obj utils.obj version.obj mswindows.obj +OBJS=cmpt.obj safe-ctype.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj \ + ftp-ls.obj ftp-opie.obj getopt.obj hash.obj headers.obj html-parse.obj \ + html-url.obj progress.obj host.obj cookies.obj http.obj init.obj \ + log.obj main.obj gen-md5.obj gnu-md5.obj netrc.obj rbuf.obj \ + snprintf.obj recur.obj res.obj retr.obj url.obj utils.obj version.obj \ + mswindows.obj LIBDIR=$(MAKEDIR)\..\lib @@ -20,7 +28,7 @@ $(LINK) @| $(LFLAGS) -Tpe -ap -c + $(LIBDIR)\c0x32.obj+ -alloca.obj+ +snprintf.obj+ version.obj+ utils.obj+ url.obj+ @@ -37,9 +45,10 @@ log.obj+ init.obj+ http.obj+ -html.obj+ host.obj+ headers.obj+ +html-parse.obj+ +html-url.obj+ getopt.obj+ ftp-opie.obj+ ftp-ls.obj+ @@ -47,7 +56,10 @@ ftp.obj+ fnmatch.obj+ connect.obj+ -cmpt.obj +cmpt.obj+ +hash.obj+ +cookies.obj+ +safe-ctype.obj $,$* $(LIBDIR)\import32.lib+ $(LIBDIR)\cw32.lib diff -ru wget-1.8.2/windows/config.h.bor wget-1.8.2.new/windows/config.h.bor --- wget-1.8.2/windows/config.h.bor Sat May 18 04:05:28 2002 +++ wget-1.8.2.new/windows/config.h.bor Wed May 29 12:22:46 2002 @@ -1,5 +1,6 @@ /* Configuration header file. - Copyright (C) 1995, 1996, 1997, 1998 Free Software Foundation, Inc. + Copyright (C) 1995, 1996, 1997, 1998, 2001, 2002 + Free Software Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -29,36 +30,23 @@ #ifndef CONFIG_H #define CONFIG_H -/* Define if you have the alloca.h header file. */ -#undef HAVE_ALLOCA_H +#define ftruncate chsize -/* AIX requires this to be the first thing in the file. */ -#ifdef __GNUC__ -# define alloca __builtin_alloca -#else -# if HAVE_ALLOCA_H -# include alloca.h -# else -# ifdef _AIX - #pragma alloca -# else -# ifndef alloca /* predefined by HP cc +Olibcalls */ -char *alloca (); -# endif -# endif -# endif -#endif +/* mswindows.h defines vsnprintf as _vsnprintf and snprintf as _snprintf + so work around that here. This is a temporary hack. The defines + in mswindows.h should be moved into config.h.ms. */ +#define _vsnprintf vsnprintf +#define _snprintf snprintf -/* Define if on AIX 3. - System headers sometimes define this. - We just want to avoid a redefinition error message. */ -#ifndef _ALL_SOURCE -/* #undef _ALL_SOURCE */ -#endif +/* Define if you have the alloca.h header file. */ +#undef HAVE_ALLOCA_H /* Define to empty if the keyword does not work. */ /* #undef const */ +/* Define to empty or
Re: query compiling wget 1.8.1 on Borland C++ 4.5
On Sat, 25 May 2002 19:03:45 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: Ian Abbott [EMAIL PROTECTED] writes: The 1.8.2 branch is pretty similar to 1.8.1 at the moment and doesn't compile with any version of Borland C++. Should we care to fix that before the release? I'm not sure how important error-free compilation under Borland is. I could attempt to get it to compile on Borland C++ 4.5. I'm not sure which previous releases compiled okay with that compiler, though. The main branch recently compiled okay with a later Borland compiler (Borland C++ 5.5.1) thanks to Chin-yuan Kuo. (This compiler was originally part of Borland's C++ Builder package, but is now available as a free (as in beer) download from Borland.) This compile is also broken at the moment, but just needs WGET_USE_STDARG defining in config.h.bor. I'll add that change to the main branch shortly. It would be easier to just apply this change to 1.8.2 than to make 1.8.2 compile with the older compiler package, I think, but I'll try and compile it with the older compiler and see if I can get anywhere with it. FWIW, the 1.8.2 branch compiles fine with the Watcom C++ 11.0 compiler.
Re: win32: how to send wget output to console and a log-file?
On Fri, 24 May 2002 20:34:38 +0400, Valery Kondakoff [EMAIL PROTECTED] wrote: I'm not sure I understand what exactly '21' means. As far as I understand '' is a redirection sign. So - '1' means stdout and '2' means stderr? They refer to the three standard file descriptors - 0 is standard input (stdin), 1 is standard output (stdout), 2 is standard error output (stderr). The '21' means 'redirect standard error output to standard output'. This results in standard output becoming a combination of standard output and standard error output. There are other things you can do with them, such as: 1file (stdout goes to file, same as file) 2file (stderr goes to file) file 2errfile (stdout goes to file, stderr goes to errfile) Some other combinations where the order matters: file 21 (both stderr and stdout go to file) 21 file (stdout goes to file, stderr goes to stdout) 21 file | command (stdout goes to file, stderr piped to command) BTW - if there are some plans to enhance wget logging possibilities? On a different thread (back in April) I suggested the following: |Perhaps we just need a --log-level=N option: | |Level 0: output just the LOG_ALWAYS messages. |Level 1: output the above and LOG_NOTQUIET messages. |Level 2: output the above and LOG_NONVERBOSE messages. |Level 3: output the above and LOG_VERBOSE messages. | |The --verbose option would be equivalent to --log-level=3 (the |default). | |The --non-verbose option would be equivalent to --log-level=2. | |The --quiet option would be equivalent to --log-level=1. However, I made a mistake in the above; the last line should have read: The --quiet option would be equivalent to --log-level=0. This means that none of the other options would be equivalent to --log-level=1. I suppose a --non-quiet option could be added for completeness, but the names of these options would be more horribly confused than they are at the moment. it would not be immediately obvious that the order of verbosity would then run: --quiet, --non-quiet, --non-verbose, --verbose.
Re: tag v:shapes ???
On Mon, 27 May 2002 16:22:57 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: Jacques Beigbeder [EMAIL PROTECTED] writes: I ran into a trouble with: wget -m http://some/site because of a line like: img src=a.gif v:shapes=... v:shapes contains a character ':', so a.gif isn't mirrored. Thanks for the report. I think I'll make NAME_CHAR_P much more forgiving about the type of characters it uses. Doing anything else is counter-productive, because too many pages use or leak weird characters in attribute names. This particular weird character looks like it's due to the use of XML namespaces. The colon separates the namespace prefix from the remainder of the attribute name.
Re: win32: how to send wget output to console and a log-file?
On Fri, 24 May 2002 15:41:01 +0400, Valery Kondakoff [EMAIL PROTECTED] wrote: Hello, Herold! 24 ìàÿ 2002 ã., you wrote to me: HH You could do something like tail -f on the logfile if you have a similar HH program installed, or log to output and | tee logfile, but all of those HH require another command. Thank you for your answer. I downloaded two win32 'tee' ports, and they works as expected when I'm entering in command line something like this: 'wget.exe -V | tee.exe wget.log', but after I enter 'wget.exe http://someurl.com | tee.exe wget.log' the 'wget.log' file remains empty... What is wrong? (WinXP Pro, GNU Wget 1.8.1+cvs). The '|' only redirects standard output, but wget writes to standard error output. To capture standard error output, you need a little utility that launches another program while capturing standard error output. While I was working somewhere else, I used a program called ftee (or ftee32) to do this, but I don't have a copy. The only references that I can find to this utility on the web indicate that it is part of Starbase's CodeWright product. (It's possible to download an evaluation version of this, but t seems a large download if you just want the little ftee utility, and legally, you shouldn't use it after the evaluation period expires.) Maybe Wget should have a -o - option to send logging output to standard output.
Re: win32: how to send wget output to console and a log-file?
On Fri, 24 May 2002 08:03:15 -0700 (PDT), Doug Kaufman [EMAIL PROTECTED] wrote: On Fri, 24 May 2002, Valery Kondakoff wrote: I downloaded two win32 'tee' ports, and they works as expected when I'm entering in command line something like this: 'wget.exe -V | tee.exe wget.log', but after I enter 'wget.exe http://someurl.com | tee.exe wget.log' the 'wget.log' file remains empty... What is wrong? (WinXP Pro, GNU Wget 1.8.1+cvs). Wget sends to stderr by default. Try wget -o - |tee wget.log. This should send output to stdout, which tee can then handle. That doesn't work. It just creates a file called -. Interestingly, I've just found out that Win NT's default command-line shell (cmd.exe) supports Unix-style redirectors. So you can use: C:\wget http://someurl.com 21 | tee wget.log That should work on Windows NT, 2000 and XP but won't work on Windows 95, 98 or ME as it uses a different comamnd-line shell (command.com).
Re: query compiling wget 1.8.1 on Borland C++ 4.5
On Wed, 22 May 2002 18:04:34 +0200, Herold Heiko [EMAIL PROTECTED] wrote: Latest cvs should compile correctly with borland compilers. The latest CVS (main branch) should compile correctly with Borland C++ 5.52 (which is a free download from Borland's site), but will not compile with earlier versions. Or, the upcoming 1.8.2 release. The 1.8.2 branch is pretty similar to 1.8.1 at the moment and doesn't compile with any version of Borland C++.
Re: Wget 1.8.2-pre1 ready for testing
On Tue, 21 May 2002 19:24:01 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: [Windows '?' problem] Ian, feel free to apply the necessary change to the 1.8.2 branch. Okay, I'll do it after work today. I've been a little busy the last few days!
Re: Wget 1.8.2-pre1 ready for testing
On Tue, 21 May 2002 06:04:59 +0200, Hrvoje Niksic [EMAIL PROTECTED] wrote: As promised, here comes the first (and hopefully only) pre-test for the 1.8.2 bugfix release. Get it from: http://fly.srk.fer.hr/~hniksic/wget-1.8.2-pre2.tar.gz Windows versions will still have problems saving filenames with the query character '?' in them. Should we introduce a temporary change to remap this to something else (e.g. '@') in the Windows version of Wget 1.8.2?
Re: FTP wildcards
On Fri, 17 May 2002 11:24:25 +0100, Ian Abbott [EMAIL PROTECTED] wrote: On Fri, 17 May 2002 08:34:27 +0200, Jan Klepac [EMAIL PROTECTED] wrote: I'd like to download all archive files wn16pcm.r[0..9][0..9] from the directory on ftp server but wget --passive-ftp ftp://ftp.ims.uni-stuttgart.de/pub/WordNet/1.6/wn16pcm.r* doesn't work and I cannot find what is wrong. Wget doesn't like the foreign dates in the directory listings. Any advice appreciated. I forgot to mention that you could try a different WordNet mirror. This doesn't solve the Wget problem, but Wget should cope better if you use an FTP server that uses English dates in the FTP listings.
Re: gopher support?
On Fri, 17 May 2002 12:41:21 +0200, Stephan Beyer [EMAIL PROTECTED] wrote: not interested in adding the Gopher feature to wget or should I still wait some time? I have no objections to adding gopher support, but it's up to the main developer (Hrvoje Niksic) whether it ends up in GNU Wget. I think he's a bit busy with his real job at the moment. I think your gopher code is still in its early phases of development at the moment. Maybe when some of your planned extra functionality is added it will stand a better chance of being accepted. In general, options should work the same for gopher as they do for http and ftp as much as possible. Since most of the patch is self-contained in gopher.c, you should be able to continue working on it without being affected by other changes in CVS too much. One minor comment about source layout: you have made some effort to conform to the GNU style, but your tabs are a bit screwy. Tabs should be 8 spaces, but indents before and after brackets should be 2 spaces.
Re: question about wget flavor
On Fri, 17 May 2002 16:59:07 +0400, Pavel Stepchenko [EMAIL PROTECTED] wrote: #!/bin/sh wget=/usr/local/bin/wget -t0 -nr -nc -x --timeout=20 --wait=61 --waitretry=120 $wget ftp://nonanonymous:[EMAIL PROTECTED]/file1.zip sleep 60 $wget ftp://nonanonymous:[EMAIL PROTECTED]/file2.zip Why WGET can make a pause between 1st and 2nd retrieval? See the sleep command above. Nothing to do with Wget in this case!
Re: cookie pb: download one file on member area
On Wed, 15 May 2002 23:41:39 +0200, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I generate de file you wanted (with -d option). I also used --load-cookies option. The generated file can be found at: http://bigben.pointclark.net/~bertra_b/wget_debug Note: I replaced the values of the cookies by (or X). Hope this will help. If you want me to go further more, tell me. I guess the Attempt to fake the domain messages in the debug log are the major clue as to why it isn't working. The main maintainer of Wget (Hrvoje Niksic) changed the cookie domain matching rules recently and the new rules should work better. The new rules are implemented in the current development version of Wget available by anonymous CVS.
Re: wget paramter -
On Thu, 16 May 2002 12:22:42 +0200, Gurkan Sengun [EMAIL PROTECTED] wrote: what about this parameter With no FILE, or when FILE is -, read standard input. (read url's actually) This is not a bug. Please use [EMAIL PROTECTED] for feature requests. It's a nice idea, but rather than `-' it should be `-i -' as that is more consistent with the existing `-O -' usage.
Re: bug report and patch, HTTPS recursive get
On Wed, 15 May 2002 18:44:19 +0900, Kiyotaka Doumae [EMAIL PROTECTED] wrote: I found a bug of wget with HTTPS resursive get, and proposal a patch. Thanks for the bug report and the proposed patch. The current scheme comparison checks are getting messy, so I'll write a function to check schemes for similarity (when I can spare the time later today).
Re: question on printing to screen
On 12 May 2002 02:54:52 -0500, asher [EMAIL PROTECTED] wrote: hi, I've been trying to figure out how wget prints all over the screen with out using curses, and I'm hoping someone can help. from the code, I'm pretty sure it's just printing to the C-stream stderr, but I can't for the life of me figure out how it seeks or jumps around in the stream. any help would be appreciated. I assume you are referring to the progress bar. It just outputs a carriage return to return to the beginning of the current line without doing a linefeed.
Re: Why must ftp_proxy be HTTP?
On Tue, 7 May 2002 17:18:57 +0800 , Fung Chai [EMAIL PROTECTED] wrote: I went through the source code (src/retr.c) of wget-1.8.1 and notice that the ftp_proxy must be HTTP; the user cannot specify it as ftp://proxy:port. In the direct mode (ie, use_proxy is set to false), retrieve_url() will use the FTP protocol to retrieve a file, but will use the HTTP protocol to retrieve the file via the proxy. Please try the current development version of Wget from the CVS repository, as this has support for FTP gateway proxy servers (FWTK-style proxies, according to the ChangeLog), but the functionality needs testing.
Re: Bug report
On Fri, 3 May 2002 18:37:22 +0200, Emmanuel Jeandel [EMAIL PROTECTED] wrote: ejeandel@yoknapatawpha:~$ wget -r a:b Segmentation fault Patient: Doctor, it hurts when I do this Doctor: Well don't do that then! Seriously, this is already fixed in CVS.
Re: problem: illegal f.s. chars in links
On Fri, 3 May 2002 14:14:37 +0200 , [EMAIL PROTECTED] wrote: Cannot write to `www.travelocity.com/Vacations/0,,TRAVELOCITY||Y,00.html@HPTRACK=icon_vac' (No such file or directory). Presumably this happens because the pipes, in particular, are illegal chars for a filename. So my question is: Correct. At least when running on Windows. Is there any chance of adding an option to translate illegal characters into legal ones both in filenames and in the links to those files? There are plans to make sure that the desired filenames get mapped to legal ones before Wget 1.9 is released, but no specific timescale that I'm aware of. There might be some options to fine tune the set of illegal characters, but the default set of illegal characters will vary depending on the platform. That should solve most problems with filenames on Windows, but probably won't deal with issues such as clashes with DOS device names such as com1, prn, nul etc., particularly as the standard set of such names can be extended willy-nilly! Links will be converted to relative links to downloaded files using their converted filenames with the -k option. (You may notice that although I'm forced to use Outlook on Windows NT to write this mail, I'm using bash and wget to do the actual work; hopefully this will improve the standing of Free software within this company, or within the QA teams at least.) With Cygwin? One possibility may be to run the tests on Linux, assuming your Linux product uses the same virus scanning algorithms and patterns as your Windows product. I'd do this myself if I knew how to write C: alas, I'm a Perl monkey myself. I thought the term was monger?
Re: W32.Klez.E removal tools
On Wed, 1 May 2002 22:12:08 +0300, robots [EMAIL PROTECTED] wrote: HTMLHEAD/HEADBODY FONTF-Secure give you the W32.Klez.E removal toolsbr W32.Klez.E is a dangerous virus that spread through email.br br For more information,please visit http://www.F-Secure.com/FONT/BODY/HTML Just in case there are one or two people stupid people out there who take everything at face value, the attachment to the above message is not a disinfectant for any virus. It should come as no surprise to most people that it is in fact intended to infect your computer's Microsoft Email program with a virus (Klez.H). The real disinfectant for this virus can be found here: ftp://ftp.f-secure.com/anti-virus/tools/fsklez.exe. For more info on this tool, see: ftp://ftp.f-secure.com/anti-virus/tools/fsklez.txt. For other free virus removal programs from the same site, see: http://www.f-secure.com/download-purchase/tools.shtml. For more info on the Klez worms, see: http://www.europe.f-secure.com/v-descs/klez.shtml.
Re: HREF=//domain.com
On Mon, 29 Apr 2002 12:03:23 -0500 (CDT), you wrote: While using wget with www.slashdot.org, the site makes use of HREF's in the following manner 'A HREF=//slashdot.org/image.gif'. It appears that when wget is following the link, it is then looking for http://www.slashdot.org//slashdot.org/image.gif; which is incorrect. That's fixed in CVS, so you can either build and install the version in CVS or wait for the next official release. Or just apply this patch to Wget 1.8.1: Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.67 retrieving revision 1.68 diff -u -r1.67 -r1.68 --- src/url.c 2001/12/14 15:45:59 1.67 +++ src/url.c 2002/01/14 01:56:40 1.68 -1575,6 +1575,37 memcpy (constr + baselength, link, linklength); constr[baselength + linklength] = '\0'; } + else if (linklength 1 *link == '/' *(link + 1) == '/') + { + /* LINK begins with // and so is a net path: we need to +replace everything after (and including) the double slash +with LINK. */ + + /* uri_merge(foo, //new/bar)- //new/bar */ + /* uri_merge(//old/foo, //new/bar) - //new/bar */ + /* uri_merge(http://old/foo;, //new/bar) - http://new/bar; */ + + int span; + const char *slash; + const char *start_insert; + + /* Look for first slash. */ + slash = memchr (base, '/', end - base); + /* If found slash and it is a double slash, then replace +from this point, else default to replacing from the +beginning. */ + if (slash *(slash + 1) == '/') + start_insert = slash; + else + start_insert = base; + + span = start_insert - base; + constr = (char *)xmalloc (span + linklength + 1); + if (span) + memcpy (constr, base, span); + memcpy (constr + span, link, linklength); + constr[span + linklength] = '\0'; + } else if (*link == '/') { /* LINK is an absolute path: we need to replace everything
Re: segmentation fault on bad url
On 22 Apr 2002 at 21:38, Renaud Saliou wrote: Hi, wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com DEBUG output created by Wget 1.8.1 on linux-gnu. zsh: segmentation fault wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after It looks like this has been fixed in the current CVS version (actually a few days old): $ wget -t 3 -d -r -l 3 -H --random-wait -nd --delete-after \ -A.jpg,.gif,.zip,.png,.pdf http://http://www.microsoft.com DEBUG output created by Wget 1.8.1+cvs on linux-gnu. http://http://www.microsoft.com: Bad port number. FINISHED --10:36:45-- Downloaded: 0 bytes in 0 files
Re: add tar option
On 23 Apr 2002 at 18:19, Hrvoje Niksic wrote: On technical grounds, it might be hard to shoehorn Wget's mode of operation into what `tar' expects. For example, Wget might need to revisit directories in random order. I'm not sure if a tar stream is allowed to do that. You can add stuff to a tar stream in a pretty much random order - that's effectively what you get when you use tar's -r option to append to the end of an existing archive. (I used to use that with tapes quite often, once upon a time.)
Re: ScanMail Message: To Recipient virus found or matched file blocki ng setting.
On 19 Apr 2002 at 10:42, Daniel Stenberg wrote: On Fri, 19 Apr 2002, System Attendant wrote: ScanMail for Microsoft Exchange has taken action on the message, please refer to the contents of this message for further details. Please. Can the admin of this ScanMail stop polluting this list even more? Looks like it's been configured to notify everyone associated with the email about the virus. Can the admin of the wget list please prevent his mails from showing up here? I suppose one such notification could be deemed useful, but if several such notifications for arrive from various ScanMails for every virus it would be more of a PITA. Of course it would be even more of a PITA if ScanMail was sending such notifcations to everybody concerned for emails that *might* contain an unknown virus due to some policy setting (e.g. email contains a .bat attachment). We don't need replies on all spam mails telling us that the spam contained viruses. One could be useful, assuming people recognize the ScanMail messages and read them *before* the infected mail! (Hmm - that looks like a good way to disguise a virus - make it look like an anti-virus notification!) Of course, we wouldn't get so many if the list filtered viruses as effectively as ScanMail!
Re: Validating cookie domains
On 19 Apr 2002 at 16:30, Hrvoje Niksic wrote: To quote from there: [...] Only hosts within the specified domain can set a cookie for a domain and domains must have at least two (2) or three (3) periods in them to prevent domains of the form: .com, .edu, and va.us. Any domain that fails within one of the seven special top level domains listed below only require two periods. Any other domain requires at least three. The seven special top level domains are: COM, EDU, NET, ORG, GOV, MIL, and INT. This is amazingly stupid. It seems to make more sense if you subtract one from the number of periods. It means that `www.arsdigita.de' cannot set the cookie for `arsdigita.de'. To make *that* work, you'd have to maintain a database of domains that use .co.xxx convention, as opposed to those that use just .xxx. Could you assume that all two-letter TLDs are country-code TLDs and require one more period than other TLDs (which are presumably at least three characters long)?
Re: wget-1.8.1: build failure on SGI IRIX 6.5 with c89
On 11 Apr 2002 at 18:55, Nelson H. F. Beebe wrote: what happens if you configure it with the option --x-includes=/usr/local/include ? On SGI IRIX 6.5, in a clean directory, I unbundled wget-1.8.1.tar.gz, and did this: % env CC=c89 ./configure --x-includes=/usr/local/include % grep HAVE_NLS src/config.h #define HAVE_NLS 1 % grep HAVE_LIBINTL_H src/config.h /* #undef HAVE_LIBINTL_H */ Okay so --x-includes didn't achieve much. I thought the x might stand for 'extra', but I guess it must be for the X Window System, and therefore irrelevant to Wget. How about: % env CC=c89 CPPFLAGS='-I/usr/local/include' ./configure There's got to be some way to get this thing to build! I just tried moving libintl.h into /usr/local/include on my machine and doing something similar: bash$ CC=cc CPPFLAGS='-I/usr/local/include' ./configure and it managed to set both HAVE_NLS and HAVE_LIBINTL_H in the resulting src/config.h and it managed to build okay.
Re: Referrer Faking and other nifty features
On 12 Apr 2002 at 17:21, Thomas Lussnig wrote: So that if one fd become -1 the loader take an new url and initate the download. And than shedulingwould work with the select(int,) what about this idee ? It would certainly make handling the logging output a bit of a challenge, especially the progress indication.
Re: No clobber and .shtml files
On 11 Apr 2002 at 21:00, Hrvoje Niksic wrote: This change is fine with me. I vaguely remember that this test is performed in two places; you might want to create a function. Certainly. Where's the best place for it? utils.c?
Re: No clobber and .shtml files
On 11 Apr 2002 at 21:00, Hrvoje Niksic wrote: This change is fine with me. I vaguely remember that this test is performed in two places; you might want to create a function. I've found three places where it checks the suffix, so I called a new function in all three places for consistency. One of those places performed a case-insensitive comparison so I made my function do that too. Hrvoje, you may wish to review whether checking the new extensions in all three places (but particularly recur.c) is a good idea or not before I commit the patch. src/ChangeLog entry: 2002-04-12 Ian Abbott [EMAIL PROTECTED] * utils.c (has_html_suffix_p): New function to text filename for common html extensions. * utils.h: Declare it. * http.c (http_loop): Use it instead of previous test. * retr.c (retrieve_url): Ditto. * recur.c (download_child_p): Ditto. Index: src/http.c === RCS file: /pack/anoncvs/wget/src/http.c,v retrieving revision 1.86 diff -u -r1.86 http.c --- src/http.c 2002/04/11 17:49:32 1.86 +++ src/http.c 2002/04/12 17:35:02 @@ -1405,7 +1405,7 @@ int use_ts, got_head = 0;/* time-stamping info */ char *filename_plus_orig_suffix; char *local_filename = NULL; - char *tms, *suf, *locf, *tmrate; + char *tms, *locf, *tmrate; uerr_t err; time_t tml = -1, tmr = -1; /* local and remote time-stamps */ long local_size = 0; /* the size of the local file */ @@ -1465,9 +1465,8 @@ *dt |= RETROKF; /* Bogusness alert. */ - /* If its suffix is html or htm, assume text/html. */ - if (((suf = suffix (*hstat.local_file)) != NULL) - (!strcmp (suf, html) || !strcmp (suf, htm))) + /* If its suffix is html or htm or similar, assume text/html. */ + if (has_html_suffix_p (*hstat.local_file)) *dt |= TEXTHTML; FREE_MAYBE (dummy); Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.43 diff -u -r1.43 recur.c --- src/recur.c 2002/02/19 06:09:57 1.43 +++ src/recur.c 2002/04/12 17:35:02 @@ -510,7 +510,6 @@ /* 6. */ { -char *suf; /* Check for acceptance/rejection rules. We ignore these rules for HTML documents because they might lead to other files which need to be downloaded. Of course, we don't know which @@ -521,14 +520,13 @@ * u-file is not (i.e. it is not a directory) and either: + there is no file suffix, -+ or there is a suffix, but is not html or htm, ++ or there is a suffix, but is not html or htm or similar, + both: - recursion is not infinite, - and we are at its very end. */ if (u-file[0] != '\0' -((suf = suffix (url)) == NULL - || (0 != strcmp (suf, html) 0 != strcmp (suf, htm)) +(!has_html_suffix_p (url) || (opt.reclevel != INFINITE_RECURSION depth = opt.reclevel))) { if (!acceptable (u-file)) Index: src/retr.c === RCS file: /pack/anoncvs/wget/src/retr.c,v retrieving revision 1.50 diff -u -r1.50 retr.c --- src/retr.c 2002/01/30 19:12:20 1.50 +++ src/retr.c 2002/04/12 17:35:03 @@ -384,12 +384,11 @@ /* There is a possibility of having HTTP being redirected to FTP. In these cases we must decide whether the text is HTML -according to the suffix. The HTML suffixes are `.html' and -`.htm', case-insensitive. */ +according to the suffix. The HTML suffixes are `.html', +`.htm' and a few others, case-insensitive. */ if (redirection_count local_file u-scheme == SCHEME_FTP) { - char *suf = suffix (local_file); - if (suf (!strcasecmp (suf, html) || !strcasecmp (suf, htm))) + if (has_html_suffix_p (local_file)) *dt |= TEXTHTML; } } Index: src/utils.c === RCS file: /pack/anoncvs/wget/src/utils.c,v retrieving revision 1.44 diff -u -r1.44 utils.c --- src/utils.c 2002/01/17 01:03:33 1.44 +++ src/utils.c 2002/04/12 17:35:03 @@ -792,6 +792,30 @@ return NULL; } +/* Checks whether a filename is has a typical HTML suffix or not. The + following suffixes are presumed to be html files (case insensitive): + + html + htm + ?html (where ? is any character) + + This is not necessarily a good indication that the file actually contains + HTML! */ +int has_html_suffix_p (const char *fname) +{ + char *suf; + + if ((suf = suffix (fname)) == NULL) +return 0; + if (!strcasecmp (suf, html)) +return 1; + if (!strcasecmp (suf, htm)) +return 1; + if (suf[0] !strcasecmp (suf + 1, html)) +return 1; + return 0; +} + /* Read a line from FP and return the pointer
Re: Your Mailing List Subscription
On 12 Apr 2002 at 14:12, [EMAIL PROTECTED] wrote: IGaming Exchange and IGaming News News Letter information You have chosen to remove yourself from all of the IGaming Exchange and IGaming News email list. If you have any questions or comments about the news letters please feel free to contact [EMAIL PROTECTED] Thank you, The River City Group Team I'm not sure which helpful person subscribed [EMAIL PROTECTED] to the above mailing lists in the first place, but hopefully I've done the right thing by unsubscribing them again!
Re: wget-1.8.1: build failure on SGI IRIX 6.5 with c89
On 11 Apr 2002 at 19:14, Hrvoje Niksic wrote: Nelson H. F. Beebe [EMAIL PROTECTED] writes: c89 -I. -I. -I/opt/include -DHAVE_CONFIG_H -DSYSTEM_WGETRC=\/usr/local/etc/wgetrc\ -DLOCALEDIR=\/usr/local/share/locale\ -O -c connect.c cc-1164 c89: ERROR File = connect.c, Line = 94 Argument of type int is incompatible with parameter of type const char *. logprintf (LOG_VERBOSE, _(Connecting to %s[%s]:%hu... ), ^ cc-1164 c89: ERROR File = connect.c, Line = 97 Argument of type int is incompatible with parameter of type const char *. The argument of type int is probably an indication that the `_' macro is either undefined or expands to an undeclared function. The compiler rightfully assumes the function to return int and complains about the type mismatch. If you check why the macro is misdeclared, you'll likely discover the source of the problem. Perhaps HAVE_NLS is defined but HAVE_LIBINTL_H isn't defined. That would cause '_(string)' to expand to 'gettext (string)' but with no declaration of the gettext() function, causing the compiler to assume a default declaration of 'int gettext()'. I think we need to examine the 'config.log' file produced when running './configure'.
Re: LAN with Proxy, no Router
On 10 Apr 2002 at 3:09, Jens Rösner wrote: wgetrc works fine under windows (always has) however, .wgetrc is not possible, but maybe . does mean in root dir under Unix? The code does different stuff for Windows. Instead of looking for '.wgetrc' in the user's home directory, it looks for a file called 'wget.ini' in the directory that contains the executable. This does not seemed to be mentioned anywhere in the documentation.
Re: -nv option; printing out infos via stderr [http://bugs.debian.org/141323]
On 9 Apr 2002 at 10:34, Hrvoje Niksic wrote: Ian Abbott [EMAIL PROTECTED] writes: On 5 Apr 2002 at 18:17, Noel Koethe wrote: Will this be changed so the user could use -nv with /dev/null and get only errors or warnings displayed? So what I think you want is for any log message tagged as LOG_VERBOSE (verbose information) or LOG_NONVERBOSE (basic information) in the source to go to stdout when no log file has been specified and the `-O -' option has not been used and for everything else to go to stderr? That change sounds dangerous. Current Wget output doesn't really have a concept of errors that would be really separate from other output; it only operates on the level of verbosity. This was, of course, a bad design decision, and I agree that steps need to be taken to change it. I'm just not sure that this is the right step. Neither am I, but I knocked up the patch on a whim. Suddenly `wget -o X' is no longer equivalent to `wget 2x', which violates the Principle of Least Surprise. Perhaps we just need a --log-level=N option: Level 0: output just the LOG_ALWAYS messages. Level 1: output the above and LOG_NOTQUIET messages. Level 2: output the above and LOG_NONVERBOSE messages. Level 3: output the above and LOG_VERBOSE messages. The --verbose option would be equivalent to --log-level=3 (the default). The --non-verbose option would be equivalent to --log-level=2. The --quiet option would be equivalent to --log-level=1. Noel would specify --log-level=1 to get the output he wants. How does that sound?
Re: getting time stamp via FTP
On 8 Apr 2002 at 11:43, Urs Thuermann wrote: Please CC: any answers to my email address, since I'm not on this list. I'd like wget to get the time stamp of a file that is downloaded via FTP and to set the mtime after writing the file to the local disk. When using HTTP, this already happens, i.e. when doing a wget http://host/file the file has the same time stamp in the local file system as on the remote server, but not with FTP. FTP supports the MODTIME command to get the time stamp of a file from the server. Could wget be changed to use this? the modtime command supported by some clients uses an FTP extension (MDTM). How widely is this supported by FTP servers? Wget recently adopted use of another extension (SIZE) and has long supported another extension (REST), so it could potentially adopt other extensions if commonly used. Currently, Wget extracts the timestamp from a directory listing of the file, but that doesn't always work, as the format for the directory listing is not standardized. Ideally, I think Wget should only have to fall back on old-style directory listings as a last resort, but that will have to wait a few years for newer mechanisms to be standardized and commonly adopted (i.e. the MLST/MLSD extensions). These links may be useful: http://www.ietf.org/html.charters/ftpext-charter.html http://www.ietf.org/internet-drafts/draft-ietf-ftpext-mlst-15.txt
Re: getting time stamp via FTP
On 9 Apr 2002 at 16:52, Ian Abbott wrote: Wget recently adopted use of another extension (SIZE) and has long supported another extension (REST), so it could potentially adopt other extensions if commonly used. Correction: 'REST' is a standard FTP protocol command, not an extension.
Re: forcing file overwrite
On 4 Apr 2002 at 17:13, Matthew Boedicker wrote: I am trying to wget Apache log files (via ftp) and since the new file will always contain at least the old, I want it to overwrite the file each time. Is there any way to do this? If there isn't, may I suggest it as a new option? I agree a new option to force clobbering would be nice. In the meantime, a workaround for your case would be to use the -N (--timestamping) option, which should have the desired effect.
Re: -nv option; printing out infos via stderr [http://bugs.debian.org/141323]
On 5 Apr 2002 at 18:17, Noel Koethe wrote: Will this be changed so the user could use -nv with /dev/null and get only errors or warnings displayed? So what I think you want is for any log message tagged as LOG_VERBOSE (verbose information) or LOG_NONVERBOSE (basic information) in the source to go to stdout when no log file has been specified and the `-O -' option has not been used and for everything else to go to stderr? I'm not sure what Hrvoje Niksic thinks of that idea, but here is a source code patch to accomplish it. I'd like some second opinions (preferably from Hrvoje) before committing it. The patch does not include any documentation changes - these will follow if the patch is committed. N.B. The patch contains a form-feed. I'm not sure if that will survive the email passage. 2002-04-05 Ian Abbott [EMAIL PROTECTED] * wget.h (enum log_options): Set order to `LOG_VERBOSE', `LOG_NONVERBOSE', `LOG_NOTQUIET', `LOG_ALWAYS' to reflect relative importance of the log messages to which they are associated. * log.c (get_log_fp): Add parameter to indicate logging level. If a log file is not being used, send `LOG_VERBOSE' and `LOG_NONVERBOSE' logs to `stdout' instead of to `stderr', except when output documents are going to `stdout'. (logputs): Pass logging level to `get_log_fp()'. (logvprintf_state): Include logging level in the state. (logvprintf): Pass logging level (from passed state) to `get_log_fp()'. (logflush): If some logs go to `stderr' and some to `stdout', ensure that both streams get flushed. (logprintf): Put logging level in state passed to `logvprintf()'. (debug_logprintf): Put `LOG_VERBOSE' logging level in state passed to `logvprintf()'. (log_init): If no log file specified, don't set `logfp' to `stderr' - leave it set to NULL so that `get_log_fp()' can decide whether to return `stdout' or `stderr' based on the logging level (and other factors). In this case, ensure logs get saved to memory if either of `stderr' or `stdout' is a TTY. (log_dump_context): Use `logfp' value directly instead of calling `get_log_fp()'. Index: src/log.c === RCS file: /pack/anoncvs/wget/src/log.c,v retrieving revision 1.12 diff -u -r1.12 log.c --- src/log.c 2001/12/19 09:36:58 1.12 +++ src/log.c 2002/04/05 18:03:44 @@ -287,12 +287,16 @@ If logging is inhibited, return NULL. */ static FILE * -get_log_fp (void) +get_log_fp (enum log_options o) { if (inhibit_logging) return NULL; if (logfp) return logfp; + if (opt.dfp == stdout) +return stderr; + if (o LOG_NOTQUIET) +return stdout; return stderr; } @@ -305,7 +309,7 @@ FILE *fp; check_redirect_output (); - if (!(fp = get_log_fp ())) + if (!(fp = get_log_fp (o))) return; CHECK_VERBOSE (o); @@ -322,6 +326,7 @@ char *bigmsg; int expected_size; int allocated; + enum log_options o; }; /* Print a message to the log. A copy of message will be saved to @@ -341,7 +346,7 @@ char *write_ptr = smallmsg; int available_size = sizeof (smallmsg); int numwritten; - FILE *fp = get_log_fp (); + FILE *fp = get_log_fp (state-o); if (!save_context_p) { @@ -411,9 +416,12 @@ void logflush (void) { - FILE *fp = get_log_fp (); - if (fp) -fflush (fp); + FILE *fp1 = get_log_fp (LOG_VERBOSE); + FILE *fp2 = get_log_fp (LOG_ALWAYS); + if (fp1) +fflush (fp1); + if (fp2 (fp2 != fp1)) +fflush (fp2); needs_flushing = 0; } @@ -497,6 +505,7 @@ CHECK_VERBOSE (o); memset (lpstate, '\0', sizeof (lpstate)); + lpstate.o = o; do { VA_START_2 (enum log_options, o, char *, fmt, args); @@ -532,6 +541,7 @@ return; memset (lpstate, '\0', sizeof (lpstate)); + lpstate.o = LOG_VERBOSE; do { VA_START_1 (char *, fmt, args); @@ -559,13 +569,10 @@ } else { - /* The log goes to stderr to avoid collisions with the output if - the user specifies `-O -'. Francois Pinard suggests - that it's a better idea to print to stdout by default, and to - stderr only if the user actually specifies `-O -'. He says - this inconsistency is harder to document, but is overall - easier on the user. */ - logfp = stderr; + /* LOG_NOTQUIET and LOG_ALWAYS logs will go to stdwrr. Other logs + will go to stdout unless the user specifies `-O -'. This allows + the user to redirect standard output but still see errors and + warnings if standard error is a TTY. */ /* If the output is a TTY, enable storing, which will make Wget remember all the printed messages, to be able to dump them to @@ -573,7 +580,7 @@ Ctrl+Break is pressed under Windows). */ if (1 #ifdef
Re: URI-parsing bug
On 4 Apr 2002 at 5:51, Tristan Horn wrote: Just wanted to point out that as of version 1.8.1, wget doesn't correctly recognize A HREF=//foo/bar-style links. tris.net/index.html: merge(http://tris.net/;, //www.arrl.org/) - http://tris.net//www.arrl.org/ (it should return http://www.arrl.org/) There haven't been any releases since 1.8.1, but this bug is fixed in the current CVS version.
Re: Serious bug in recursive retrieval behaviour occured in v. 1.8
On 4 Apr 2002 at 13:21, Robert Mücke wrote: So it seems to be important to correct this behaviour. I think you only need to set up a test site (maybe with some subdirs) containing one file with an errorous href= tag to reproduce this (maybe only in parts depending on your server configuration). I couldn't reproduce this with wget 1.8 and a local Apache server (but I didn't attempt to reconfigure Apache in an attempt to reproduce it). A few recursive retrieval bugs were fixed in wget 1.8.1. Is it possible for you to test that version? (You may want to limit the recursion depth and the maximum amount to download if repeating the test!)
Re: cuj.com file retrieving fails -why?
On 3 Apr 2002 at 14:56, Markus Werle wrote: Jens Rösner wrote: So, I do not know what your problem is, but is neither wget#s nor cuj's fault, AFAICT. :-( I've just built Wget 1.7 on Linux and it seemed to download your problem file okay. So I don't know what your problem is either!
Re: cuj.com file retrieving fails -why?
On 3 Apr 2002 at 17:09, Markus Werle wrote: Ian Abbott wrote: On 3 Apr 2002 at 14:56, Markus Werle wrote: I've just built Wget 1.7 on Linux and it seemed to download your problem file okay. So I don't know what your problem is either! Ah! The kind of problem I like most! Did You have a special .wgetrc? Nothing special. $HOME/.wgetrc : robots = off system wgetrc : # Comments stripped out passive_ftp = on waitretry = 10
Re: spanning hosts
On 28 Mar 2002 at 18:01, Jens Rösner wrote: I came across a crash caused by a cookie two days ago. I disabled cookies and it worked. I'm hoping you had debug output on when it crashed, otherwise this is a different crash to the one I already know about. Can you confirm this, please? Yes, I had debug output on. Thanks for the confirmation. wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html Result with 1.8.1 and 1.7.1 with -nh: audistory.com: Only index.html audistory.de: Everything audi100-online: only the first page kolaschnik.de: only the first page Yes, that's how I thought it would behave. Any URLs specified on the command line or in a --include-file file are always downloaded irregardless of the domain acceptance rules. Well, one page of a rejected URL is downloaded, not more. Whereas the only accepted domain audistory.de gets downloaded completely. Doesn't this differ from what you just said? Well I only said the URLs specified on the command line or by the --include-file option are always downloaded. I didn't intend this to be interpreted as also applying to URLs which Wget finds while examining the contents of the downloaded html files. At the moment, the domain acceptance/rejection checks are only performed when downloaded html files are examined for further URLs to be downloaded (for the --recursive and --page-requisites options), which is why it behaves as it does. Agreed! How about introducing wildcards like -Dbar.com behaves strictly: www.bar.com, www2.bar.com -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com, sex-bar.computer-dating.com That would leave current command lines operational and introduce many possibilities without (too much) fuss. Or have I overlooked anything here? It sounds like it should work okay. I'd prefer to let -Dbar.com also match fubar.com for compatibility's sake. If you wanted to match www.bar.com and www2.bar.com, but not www.fubar.com you could use -D.bar.com, but that wouldn't work if you wanted to match bar.com without the www (well, a leading . could be treated as a special case). It would be easiest and more consistent (currently) to use shell-globbing wildcards (as used for the file-acceptance rules) rather than grep/egrep-style wildcards.
Re: about wget and put
On 31 Mar 2002 at 14:23, ¶À«¾§ wrote: may I ask some question? do wget offer put function? (FTP put) No current version of wget offers this function. I need wget function, but reverse way, like put... can wget do it? or is there any tool offer this? There is a command-line tool called curl which can get and put by HTTP and FTP. There is another command-line program called lftp which will also do this. ps. I need put the newer or modified files, by automatically judge...like wget does... I don't think either program does that.
Re: wget parsing JavaScript
On 26 Mar 2002 at 19:33, Tony Lewis wrote: I wrote: wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. and Ian Abbott responded: I think it does, actually, but that is mostly harmless. You're right. What I meant was that it does not examine the JavaScript looking for URLs. It won't examine the file downloaded via script src=ascript.js (unless the HTTP response claimed it had a MIME type of text/html for some reason!), but it will examine the contents between a script and a /script tag. For example, a recursive retrieval on a page like this: html body script a href=foo.htmlfoo/a /script /body /html will retrieve foo.html, regardless of the script.../script tags.
Re: wget parsing JavaScript
On 26 Mar 2002 at 7:05, Tony Lewis wrote: Csaba Ráduly wrote: I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to parse whatever it's inside. Why was this implemented ? JavaScript is most used to construct links programmatically. wget is likely to find bogus URLs until it can properly parse JavaScript. wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. I think it does, actually, but that is mostly harmless. I haven't heard of any cases where it has caused a problem (assuming the script is well-formed). It's normal good practice to hide the code in a HTML comment anyway, but perhaps that good practice is less common now these days now that virtually every browser out there groks SCRIPT/SCRIPT and NOSCRIPT/NOSCRIPT. Wget's HTML parser doesn't yet have the hooks to allow different elements (such as SCRIPT and STYLE) to be processed differently to normal HTML. If it gets these hooks it could then go off and process the SCRIPT element differently. (The minimal processing for the SCRIPT element, if it is using an an unsupported script language would be to skip it.) If a future version of Wget were to handle JavaScript as an option (perhaps using the GPL'd SpiderMonkey), it would have to parse the default action of the script and also possibly exercise the various event handlers to gather more URLs. I guess this would fail on the more complicated scripts that expect some sort of intelligent being (or a suitably programmed robot) to fill in forms and/or press buttons in the correct sequence to progress to the next page!
Re: spanning hosts: 2 Problems
On 26 Mar 2002 at 19:01, Jens Rösner wrote: I am using wget to parse a local html file which has numerous links into the www. Now, I only want hosts that include certain strings like -H -Daudi,vw,online.de It's probably worth noting that the comparisons between the -D strings and the domains being followed (or not) is anchored at the ends of the strings, i.e. -Dfoo matches bar.foo but not foo.bar. Two things I don't like in the way wget 1.8.1 works on windows: The first page of even the rejected hosts gets saved. That sounds like a bug. This messes up my directory structure as I force directories (which is my default and normally useful) I am aware that wget has switched to breadth first (as opposed to depth-first) retrieval. Now, with downloading from many (20+) different servers, this is a bit frustrating, as I will probably have the first completely downloaded site in a few days... Would that be less of a problem if the first problem (first page from rejected domains) was fixed? Is there any other way to work around this besides installing wget 1.6 (or even 1.5?) No, but note that if you pass several starting URLs to Wget, it will complete the first before moving on to the second. That also works for the URLs in the file specified by the --input-file parameter. However, if all the sites are interlinked, you would be no better off with this. The other alternative is to run wget several times in sequence with different starting URLs and restrictions, perhaps using the --timestamping or --no-clobber options to avoid downloading things more than once.
Re: OK, time to moderate this list
On 22 Mar 2002 at 4:08, Hrvoje Niksic wrote: The suggestion of having more than one admin is good, as long as there are people who volunteer to do it besides me. I'd volunteer too, but don't want to be the only person moderating the lists for the same reasons as yourself. (I'm also completely clueless about the process of moderating mailing lists at the moment!) I also have to check with the sunsite.dk people whether the ML manager, ezmlm, can handle this. If it only handles a single moderator account, perhaps a secure web-based email account could be set up for moderation purposes which the real moderators could log into on a regular basis.
Re: Wget and Symantec Web Security
On 19 Mar 2002 at 22:53, Löfstrand Thomas wrote: I use wget to get files from a FTP server. The proxy server is Symantecs web security 2.0 product for solaris which has a antivirus function. I have used wget with -d option to see what is going on, and it seems like the proxyserver returns the following response: X-PLEASE_WAIT. After reading the source code in http.c it seems like wget expects the answer from the proxy to be HTTP/ and a version number. Is there any easy way to bypass this response part or to make a little bit of coding so I can accept the X-PLEASE-WAIT String? Your proxy server has a broken HTTP implementation. Does this temporary patch to Wget 1.8.1 work around the problem? --- src/http.c.old Thu Mar 21 17:43:25 2002 +++ src/http.c Thu Mar 21 18:01:15 2002 -949,6 +949,16 if (hcount == 1) { const char *error; + + /* TEMPORARY PATCH */ + /* Check for broken Symantec Web Security proxy. */ + if (strncmp(hdr, X-PLEASE_WAIT, 13) == 0) + { + hcount--; + goto done_header; + } + /* TEMPORARY PATCH */ + /* Parse the first line of server response. */ statcode = parse_http_status_line (hdr, error); hs-statcode = statcode;
Re: wget1.8.1's patches for using the free Borland C++Builder compile r
On 12 Mar 2002 at 3:18, sr111 wrote: I have to modify some files in order to build win32 port of wget using the free Borland C++Builder compiler. Please refer to the attachment file for the details. I've modified Chin-yuan Kuo's patch for the current CVS. It builds fine with the free Borland C++Builder compiler. I also tried to build it with the Borland C++ Release 5.0 compiler but ran into problems compiling src/utils.c on the following lines 1499-1500: wt-wintime.HighPart = ft.dwHighDateTime; wt-wintime.LowPart = ft.dwLowDateTime; (Those errors were nothing to do with Chin-yuan Kuo's patch.) Chin-yuan's patch distances the support for Borland's compilers further away from the Release 5.0 (and earlier) compilers, but since the C++Builder compiler can be downloaded for free I don't think support for the older compilers is that much of an issue (apart from making a little more work for the MS-DOS porters). If there are no objections from the Win32 maintainers, I'll apply the updated patch to CVS tomorrow. Chin-yuan did not submit any ChangeLog entries, so here is my attempt at some: main ChangeLog entry: 2002-03-18 Chin-yuan Kuo [EMAIL PROTECTED] * configure.bat.in: Do not check %BORPATH% as C++Builder compiler does not use it. * windows/Makefile.src.bor: * windows/config.h.bor: Migrate to free C++Builder compiler. And here is the updated patch: Index: configure.bat.in === RCS file: /pack/anoncvs/wget/configure.bat.in,v retrieving revision 1.1 diff -u -r1.1 configure.bat.in --- configure.bat.in2002/03/13 19:47:26 1.1 +++ configure.bat.in2002/03/18 20:31:51 @@ -20,8 +20,7 @@ if .%1 == .--borland goto :borland if .%1 == .--msvc goto :msvc if .%1 == .--watcom goto :watcom -if not .%BORPATH% == . goto :borland -if not .%1 == . goto :usage +goto :usage :msvc copy windows\config.h.ms src\config.h nul @@ -58,5 +57,5 @@ goto :end :usage -echo Usage: Configure [--borland | --msvc | --watcom] +echo Usage: configure [--borland | --msvc | --watcom] :end Index: windows/Makefile.src.bor === RCS file: /pack/anoncvs/wget/windows/Makefile.src.bor,v retrieving revision 1.4 diff -u -r1.4 Makefile.src.bor --- windows/Makefile.src.bor2001/12/04 10:33:18 1.4 +++ windows/Makefile.src.bor2002/03/18 20:31:52 @@ -2,16 +2,16 @@ ## Makefile for use with watcom win95/winnt executable. CC=bcc32 -LINK=tlink32 +LINK=ilink32 LFLAGS= -CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w- +CFLAGS=-DWINDOWS -DHAVE_CONFIG_H -I. -H -H=wget.csm -w- -O2 ## variables OBJS=cmpt.obj connect.obj fnmatch.obj ftp.obj ftp-basic.obj \ - ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html.obj \ + ftp-ls.obj ftp-opie.obj getopt.obj headers.obj host.obj html-parse.obj html- url.obj \ http.obj init.obj log.obj main.obj gnu-md5.obj netrc.obj rbuf.obj \ - alloca.obj \ + safe-ctype.obj hash.obj progress.obj gen-md5.obj cookies.obj \ recur.obj res.obj retr.obj url.obj utils.obj version.obj mswindows.obj LIBDIR=$(MAKEDIR)\..\lib @@ -20,7 +20,9 @@ $(LINK) @| $(LFLAGS) -Tpe -ap -c + $(LIBDIR)\c0x32.obj+ -alloca.obj+ +cookies.obj+ +hash.obj+ +safe-ctype.obj+ version.obj+ utils.obj+ url.obj+ @@ -37,7 +39,8 @@ log.obj+ init.obj+ http.obj+ -html.obj+ +html-parse.obj+ +html-url.obj+ host.obj+ headers.obj+ getopt.obj+ Index: windows/config.h.bor === RCS file: /pack/anoncvs/wget/windows/config.h.bor,v retrieving revision 1.3 diff -u -r1.3 config.h.bor --- windows/config.h.bor2001/11/29 14:15:10 1.3 +++ windows/config.h.bor2002/03/18 20:31:52 @@ -19,6 +19,10 @@ #ifndef CONFIG_H #define CONFIG_H +#define HAVE_MEMMOVE +#define ftruncate chsize +#define inline __inline + /* Define if you have the alloca.h header file. */ #undef HAVE_ALLOCA_H @@ -33,7 +37,7 @@ #pragma alloca # else # ifndef alloca /* predefined by HP cc +Olibcalls */ -char *alloca (); +#include malloc.h # endif # endif # endif @@ -177,7 +181,7 @@ #define HAVE_BUILTIN_MD5 1 /* Define if you have the isatty function. */ -#undef HAVE_ISATTY +#define HAVE_ISATTY #endif /* CONFIG_H */
(Fwd) Proposed new --unfollowed-links option for wget
This seems more appropriate for the main Wget list. The wget-patches list is for patches! --- Forwarded message follows --- From: Tony Lewis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject:Proposed new --unfollowed-links option for wget Date sent: Thu, 7 Mar 2002 23:41:15 -0800 Last night I was roaming through Google looking for a program to let me grab chunks of a web site and found a reference to wget. After reading the manual, I downloaded and built it and found that it does almost everything I needed. There are two features that I need that are missing. One of them is getting a list of the links that were not followed by wget. (The other is the subject of another message.) I have skimmed a few GNU programs in the past and found the source for wget pretty easy to follow. I was able to implement this feature today by adding the following command line argument: -u, --unfollowed-links=FILE log unfollowed links to FILE. Having used the option on a couple of sites that I maintain, I have already found it very useful. For example: after running wget --mirror -uexternal http://www.mysite.com;, I have a list of all the external references made by my site in the file external. Unfortunately, I made all my changes directly to the distribution sources before I stumbled across the long list of instructions for using CVS. Before I redo the changes following the CVS route, I'd like to know a little bit more about the process for getting a submission approved for inclusion in a future version (particularly in light of this change grabbing one of the seven -- by my count -- remaining single-letter command line arguments). Also, is there some sort of regression test suite that I should run? Tony Lewis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- End of forwarded message ---
(Fwd) Processing of JavaScript
--- Forwarded message follows --- From: Tony Lewis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject:Processing of JavaScript Date sent: Fri, 8 Mar 2002 00:04:43 -0800 Some web sites include URL references within JavaScript. Poorly designed sites (including one of my own, I must confess) build significant site navigation features in script. Has anyone thought about what it would take to have wget parse the JavaScript looking for urls? I have looked briefly (very briefly) at SpiderMonkey and I suspect it could be integrated with wget. Thoughts? Tony Lewis PS) I'm done for tonight! ;-) --- End of forwarded message ---
(Fwd) Automatic posting to forms
--- Forwarded message follows --- From: Tony Lewis [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject:Automatic posting to forms Date sent: Thu, 7 Mar 2002 23:43:28 -0800 As promised in my earlier note, there is a second feature I'm looking for in wget. This feature is the ability to automatically post to forms. I'm thinking of something along the lines of a command line argument like: --auto-post=FILE where FILE would contain data such as: form=/cgi-bin/auth.cgi name=id value=tony name=pw value=password With this information, any time that wget encounters a form whose action is /cgi-bin/auth.cgi, it will enqueue the submission of the form using the values provided for the fields id and pw. Before I go too deep into making this change, I'd like some feedback. I know that I will need to change: - get_urls_html to look for a FORM tag whose action attribute matches the auto-post file - retrieve_tree to be able to POST as well as GET - main and initialize to deal with the new command line argument Is there anything else that seems obvious that I'm overlooking? Any cautions about the sections of code I'll be working with? Tony Lewis - To unsubscribe, e-mail: wget-patches- [EMAIL PROTECTED] For additional commands, e-mail: wget-patches- [EMAIL PROTECTED] --- End of forwarded message ---
Re: reading HTML input-files (WITH ATTACHMNT!)
On 8 Mar 2002 at 10:50, Mathias Kratzer wrote: I admit that the lines in my original file contain a really stupid syntax error. As an absolute beginner with the Markup Languages I have just tried to learn from some hyperlink examples but obviously misunderstood their formal structure. Nevertheless, Wget 1.5.2 did recognize my URLs! Well, as you noted, the HTML parser was rewritten for Wget 1.7, so it is not too surprising that it would behave differently for erroneous input! So does Wget 1.7 after I've changed the lines to SGML format. However, I feel obliged to inform you that XML format didn't solve the problem. Ah yes, the XML (XHTML) form was not supported until Wget 1.8 or 1.8.1 (I can't remember which, and can't be arsed to find out at the moment!).
Re: reading HTML input-files (WITH ATTACHMNT!)
On 7 Mar 2002 at 17:50, Mathias Kratzer wrote: While calling Wget 1.5.2 by wget -F -O 69_4_522_Ref.res -i 69_4_522_Ref.mrq on the attached file 69_4_522_Ref.mrq has worked very well I am left with the error message No URLs found in 69_4_522_Ref.mrq whenever I try the same command using Wget 1.7. Even embedding the content of 69_4_522_Ref.mrq into a HTML4 frame (i.e. DOCTYPE-header, html-, head- and body-tags) did not help. Can you tell me what I am doing wrong? The file 69_4_522_Ref.mrq contains several lines of the form: a href=url/a which looks pretty invalid to me. Perhaps you need to change them to: a href=url/ (XML format) or: a href=url/a (SGML format)
Re: retr.c:253: calc_rate: Assertion `msecs = 0' failed.
On 6 Mar 2002 at 12:43, Mats Palmgren wrote: I have a cron job that downloads Mozilla every night using wget. Last night I got: wget: retr.c:253: calc_rate: Assertion `msecs = 0' failed. I think this can happen if the system time is reset backwards while wget is downloading stuff.
Re: wget info page
On 20 Feb 2002 at 12:54, Noel Koethe wrote: wget 1.8.1 is shipped with the files in doc/ wget.info wget.info-1 wget.info-2 wget.info-3 wget.info-4 They are build out of wget.texi if I remove them and makeinfo is installed. The files are removed when runing make realclean. I think they should/could also removed when runing make distclean, or am I missing an important point? Perhaps they are included in the distribution in case the system does not have the tools to rebuild them? However, the presence of wget.info* in the distribution does seem inconsistent with the absence of the wget.1 manpage file.
No clobber and .shtml files
Here is a patch for a potential feature change. I'm not sending it to the wget-patches list yet, as I'm not sure if it should be applied as is, or at all. The feature change is a minor amendment to the (bogus) test for whether an existing local copy of a file is text/html when the or not when the --noclobber option is used, based on its suffix. The current test assumes the local file is text/html if it has a suffix of html or htm. The amendment made by this patch includes suffixes of the form shtml, phtml, etc. in the set of suffixes assumed to indicate text/html files. As it stands, the new test treats any ?html suffix (where ? matches a single character as indicating a text/html file. Perhaps this test should be tightened up to only allow a letter rather than any character in this position. I didn't bother testing for ?htm, as I've never seen it and can't think why anyone would want to use it. (However, I do recall seeing suffixes such as sht before now, i.e. shtml truncated to 3 characters, but perhaps that's going too far.) Any comments? Index: src/http.c === RCS file: /pack/anoncvs/wget/src/http.c,v retrieving revision 1.85 diff -u -r1.85 http.c --- src/http.c 2002/02/19 05:18:43 1.85 +++ src/http.c 2002/02/20 19:25:34 @@ -1462,8 +1462,10 @@ /* Bogusness alert. */ /* If its suffix is html or htm, assume text/html. */ - if (((suf = suffix (*hstat.local_file)) != NULL) - (!strcmp (suf, html) || !strcmp (suf, htm))) + /* Also assume text/html if its suffix is shtml, phtml, etc. */ + if (((suf = suffix (*hstat.local_file)) != NULL) *suf + (!strcmp (suf, html) || !strcmp (suf, htm) + || !strcmp(suf+1, html))) *dt |= TEXTHTML; FREE_MAYBE (dummy);
Re: wget bug?!
[The message I'm replying to was sent to [EMAIL PROTECTED]. I'm continuing the thread on [EMAIL PROTECTED] as there is no bug and I'm turning it into a discussion about features.] On 18 Feb 2002 at 15:14, TD - Sales International Holland B.V. wrote: I've tried -w 30 --waitretry=30 --wait=30 (I think this one is for multiple files and the time in between those though) None of these seem to make wget wanna wait for 30 secs before trying again. Like this I'm hammering the server. The --waitretry option will wait for 1 second for the first retry, then 2 seconds, 3 seconds, etc. up to the value specified. So you may consider the first few retry attempts to be hammering the server but it will gradually back off. It sounds like you want an option to specify the initial retry interval (currently fixed at 1 second), but Wget currently has no such option, nor an option to change the amount it increments by for each retry attempt (also currently fixed at 1 second). If such features were to be added, perhaps it could work something like this: --waitretry=n - same as --waitretry=n,1,1 --waitretry=n,m - same as --waitretry=n,m,1 --waitretry=n,m,i - wait m seconds for the first retry, incrementing by i seconds for subsequent retries up to a maximum of n seconds The disadvantage of doing it that way is that no-one will remember which order the numbers should appear, so an alternative is to leave --waitretry alone and supplement it with --waitretryfirst and --waitretryincr options.
Re: wget crash
On 14 Feb 2002 at 16:02, Steven Enderle wrote: Sorry for not including any version information. This is version 1.8.1, which I am using. Sorry for not reading your bug report properly. I should have realised that this was a different bug to the hundreds (it seems!) of other reports about assertion failures in progress.c.
Re: wget crash
On 14 Feb 2002 at 10:41, Steven Enderle wrote: assertion percentage = 100 failed: file progress.c, line 552 zsh: abort (core dumped) wget -m -c --tries=0 ftp://ftp.scene.org/pub/music/artists/nutcase/mp3/timeofourlives.mp3 hope this helps in any way. Thanks for the report. That's a known bug in Wget 1.8 that is fixed in Wget 1.8.1.
Re: wget 1.8.x proxies
On 12 Feb 2002 at 12:30, Holger Pfaff wrote: I'm having trouble using wget 1.8.[01] over a (squid24-) proxy to mirror a ftp-directory: # setenv ftp_proxy http://139.21.68.25: # wget181 -r -np -l0 ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates --12:06:58-- ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates = `ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates' Connecting to 139.21.68.25:... connected. Proxy request sent, awaiting response... 200 OK Length: unspecified [text/html] [ = ] 3,665 3.50M/s 12:06:58 (3.50 MB/s) - `ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates' saved [3665] I've never tried wget through an http-based ftp proxy. Are there any clues in the file it wrote (presumably a html-format directory listing)? Are there any more clues if you use the -d (--debug) option?
Re: wget 1.8.x proxies
On 12 Feb 2002 at 7:54, Winston Smith wrote: # wget181 -r -np -l0 ftp://ftp.funet.fi/pub/Linux/mirrors/redhat/redhat/linux/updates ummm... looks like the -l0 might be limiting your recursion level to 0 levels No. '-l0' is the same as '-l inf'.
Re: KB or kB
On 8 Feb 2002 at 4:26, Fred Holmes wrote: At 02:54 AM 2/8/2002, Hrvoje Niksic wrote: Wget currently uses KB as abbreviation for kilobyte. In a Debian bug report someone suggested that kB should be used because it is more correct. The reporter however failed to cite the reference for this, and a search of the web has proven inconclusive. Well, certainly among physicists, the k for kilo = x1000 is lower case. Consult any style manual for writing articles in scholarly physics journals. Of course, computer folks do as they please. g Not just amongst physicists, k is the standard prefix for kilo, at least when kilo means 10^3 (=1000). Think km = kilometer (or kilometre), kg = kilogram (or kilogramme), etc. This does not really apply to computer usage where typically kilo has been overloaded to mean 2^10 (=1024) because it happens to be close enough to its more correct meaning. That's why K is often used to mean 2^10 to avoid confusion with k. (But as has been pointed out, this confusion persists for M, G, T, etc.) I'd suggest either leaving them alone or adopting the IEC standards that Henrik referred to, i.e. KiB = kibibyte = 2^10 bytes, MiB = mebibyte = 2^20 bytes, etc. Of course, that would likely produce asserts in progress.c ;-)
Re: @ sign in username
On 4 Feb 2002 at 15:21, Christian Busch wrote: Hello, i have a question. On a ftp-site that we need to mirror, our login is wget -cm ftp://christian.busch%40brainjunction.de:**xx**@esd.intraware.com/ as you see I tried to encode the @ as %40 as described in the manual. This does not work, is there any way to encode the @ in the username ? No, but does the following work? wget -cm -e [EMAIL PROTECTED] ftp://esd.intraware.com/ FYI, there was no need to forward your the message to [EMAIL PROTECTED] unless you were submitting a bug report. All the traffic sent to [EMAIL PROTECTED] ends up on the [EMAIL PROTECTED] list anyway.
HTTP/1.1 (was Re: timestamping content-length --ignore-length)
On 1 Feb 2002 at 8:17, Daniel Stenberg wrote: You may count this mail as advocating for HTTP 1.1 support, yes! ;-) I did write down some minimal requirements for HTTP/1.1 support on a scrap of paper recently. It's probably still buried under the more recent strata of crap on my desk somewhere! I know chunked encoding support was one of the requirements, but I can't remember any others I wrote down. It was probably an incomplete list anyway! HTTP/1.1 support would also allow gzip and deflate encodings etc. to be added as configurable options later. Once HTTP/1.1 support was working reliably, it ought to be made the default, with command-line or .wgetrc options to fall back to sending HTTP/1.0 requests.
Re: Downloading all files by http:
On 31 Jan 2002 at 9:25, Fred Holmes wrote: wget -N http://www.karenware.com/progs/*.* fails with a not found whether the filespec is * or *.* The * syntax works just fine with ftp Is there a syntax that will get all files with http? You could try wget -m -l 1 -n http://www.karenware.com/progs/ but it will only do what you want if the web server sends back a HTML-format directory listing (complete with links to each file), rather than some other document.
Re: timestamping content-length --ignore-length
On 31 Jan 2002 at 8:41, Bruce BrackBill wrote: The problem is, that my web pages are served up by php and the content lengh is not defined. So as the manual states I use --ignore-length. But when wget retrieves an image it slows right down, possibly because it is ignoring the content-length. Maybe an option to ignore the content length of certain file types ( say text/html ) would be an option for upcomming releases of wget. The problem is that wget uses persistent connections by default if the server supports them. As you are using --ignore-length, wget must wait for more data will arrive while the connection is open. The persistent connection is closed by the server after a timeout - as far as it is concerned, it has already completed the request and is waiting for a new request to re-use the same connection. This timeout is what is causing the delays you are seeing. You can tell wget not to allow persistent connections using the --no-http-keep-alive option, which should speed things up in your case. By the way, have you tried it without the --ignore-length option to see if it works? Perhaps the manual ought to mention the undesirability of using --ignore-length with persistent connections.
Re: timestamping content-length --ignore-length
On 31 Jan 2002 at 9:48, Bruce BrackBill wrote: Thanks for your responce Ian. When I use it without --ignore-length option it appears that wget SOMETIMES ignores the last_modified_date OR wget says to itself ( hey, I see the file is older than the local copy, but hey, since the server isn't sending me a content_length i'm just going to download it again anyway :-). According the the manual ( as I read it ) wget should ALWAYS reget the file if it has an empty content length ( even though this is undesirable behavior ). Sorry I ignored the timestamping part of your question. My answer only addressed the delays you were getting. It depends on the SOMETIMES. Can you provide a sample debug output log (-d) and point out where you think wget is not behaving like you want it to? Also, you haven't mentioned which version of wget you are using yet. Wget should behave exactly the same for --ignore-length as it does when there is no Content-Length header, and as far as I can see from the source code, it does. If no Content-Length header was received, or it was ignored then only the timestamps are compared. Although the manual says that the Content-Length is used as an additional check, it fails to mention that that only applies when the Content-Length header exists and the --ignore-length option has not been used. 2) In the php scripts I send out last_modified_date 3) php does not send content_length ( and I don't do it either in the script ) In that case, Wget's timestamping retrieval decision is based solely on the Last-Modified header, regardless of whether you use --ignore-length or not. A debug log would help confirm this.
Re: Bug report: 1) Small error 2) Improvement to Manual
On 17 Jan 2002 at 2:15, Hrvoje Niksic wrote: Michael Jennings [EMAIL PROTECTED] writes: WGet returns an error message when the .wgetrc file is terminated with an MS-DOS end-of-file mark (Control-Z). MS-DOS is the command-line language for all versions of Windows, so ignoring the end-of-file mark would make sense. Ouch, I never thought of that. Wget opens files in binary mode and handles the line termination manually -- but I never thought to handle ^Z. Why not just open the wgetrc file in text mode using fopen(name, r) instead of rb? Does that introduce other problems? In the Windows C compilers I've tried (Microsoft and Borland ones), r causes the file to be opened in text mode by default (there are ways to override that at compile time and/or run time), and this causes the ^Z to be treated as an EOF (there might be ways to override that too).
Re: Bug report: 1) Small error 2) Improvement to Manual
On 21 Jan 2002 at 14:56, Thomas Lussnig wrote: Why not just open the wgetrc file in text mode using fopen(name, r) instead of rb? Does that introduce other problems? I think it has to do with comments because the defeinition is that starting with '#' the rest of the line is ignored. And an line ends with '\n' or the end of the file and not with and spezial charakter '\0' that mean for me that to abort the reading of an textfile when zero isfound mean's incorrect parsing. (N.B. the control-Z character would be '\032', not '\0'.) So maybe just mention in the documentation that the wgetrc file is considered to be a plain text file, whatever that means for the system Wget is running on. Maybe mention peculiaries of DOS/Windows, etc. In general, it is more portable to read or write native text files in text mode as it performs whatever local conversions are necessary to make reads and writes of text files appear like UNIX i.e. each line of text terminated by a newline '\n'). In binary mode, what you get depends on the system (Mac text files have lines terminated by carriage return ('\r') for example, and some systems (VMS?) don't even have line termination characters as such.) In the case of Wget, log files are already written in text mode. I think wgetrc needs to be read in text mode and that's an easy change. In the case of the --input-file option, ideally the input file should be read in text mode unless the --force-html option is used, in which case it should be read in the same mode as when parsing other locally-stored HTML files. Wget stores retrieved files in binary mode but the mode used when reading those locally-stored files is less precise (not that it makes much difference for UNIX). It uses open() (not fopen()) and read() to read those files into memory (or uses mmap() to map them into memory space if supported). The DOS/Windows version of open() allows you to specify text or binary mode, defaulting to text mode, so it looks like the Windows version of Wget saves html files in binary mode and reads them back in in text mode! Well whatever - the HTML parser still seems to work okay on Windows, probably because HTML isn't that fussy about line-endings anyway! So to support --input-file portably (not the --force-html version), the get_urls_file() function in url.c should probably call a new function read_file_text() (or read_text_file() instead of read_file() as it does at the moment. For UNIX-type systems, that could just fall back to calling read_file(). The local HTML file parsing stuff should probably be left well alone but possibly add some #ifdef code for Windows to open the file in binary mode, though there may be differences between compilers for that.
Re: Passwords and cookies
On 17 Jan 2002 at 18:17, Hrvoje Niksic wrote: Ian Abbott [EMAIL PROTECTED] writes: I'm also a little worried about the (time_t *)cookie-expiry_time cast, as cookie-expiry time is of type unsigned long. Is a time_t guaranteed to be the same size as an unsigned long? It's not, but I have a hard time imagining an architecture where time_t will be *larger* than unsigned long. I received an email from Csaba Ráduly which I hope he won't mind me quoting here: On 17 Jan 2002 at 12:45, [EMAIL PROTECTED] wrote: Very few may care, but IBM's C/C++ compilers v 3.6.5 typedef time_t as ... double ! Shouldn't cookie-expiry_time be declared as time_t ?
Re: Passwords and cookies
On 16 Jan 2002 at 17:50, Hrvoje Niksic wrote: Wget's strptime implementation comes from an older version of glibc. Perhaps we should simply sync it with the latest one from glibc, which is obviously capable of handling it? That sounds like a good plan.
Re: Passwords and cookies
On 16 Jan 2002 at 17:45, Hrvoje Niksic wrote: Aside from google, ~0UL is Wget's default value for the expiry time, meaning the cookie is non-permanent and valid throughout the session. Since Wget sets the value, Wget should be able to print it in DEBUG mode. Do you think this patch would fix the printing problem: Index: src/cookies.c === RCS file: /pack/anoncvs/wget/src/cookies.c,v retrieving revision 1.18 diff -u -r1.18 cookies.c --- src/cookies.c 2001/12/10 02:29:11 1.18 +++ src/cookies.c 2002/01/16 16:43:21 @@ -241,7 +241,9 @@ cookie-domain, cookie-port, cookie-path, cookie-permanent ? permanent : nonpermanent, cookie-secure, -asctime (localtime ((time_t *)cookie-expiry_time)), +(cookie-expiry_time != ~0UL ? + asctime (localtime ((time_t *)cookie-expiry_time)) + : UNKNOWN), cookie-attr, cookie-value)); } Yes, except for any other values of cookie-expiry_time that would cause localtime() to return a NULL pointer (in the case of Windows, anything before 1970). Perhaps the return value of localtime() should be checked before passing it to asctime() as in the modified version of your patch I have attached below. I'm also a little worried about the (time_t *)cookie-expiry_time cast, as cookie-expiry time is of type unsigned long. Is a time_t guaranteed to be the same size as an unsigned long? Index: src/cookies.c === RCS file: /pack/anoncvs/wget/src/cookies.c,v retrieving revision 1.18 diff -u -r1.18 cookies.c --- src/cookies.c 2001/12/10 02:29:11 1.18 +++ src/cookies.c 2002/01/17 11:29:00 @@ -184,6 +184,9 @@ struct cookie *chain_head; char *hostport; char *chain_key; +#ifdef DEBUG + struct tm *local_expiry; +#endif if (!cookies_hash_table) /* If the hash table is not initialized, do so now, because we'll @@ -241,7 +244,10 @@ cookie-domain, cookie-port, cookie-path, cookie-permanent ? permanent : nonpermanent, cookie-secure, - asctime (localtime ((time_t *)cookie-expiry_time)), + (cookie-expiry_time != ~0UL + NULL != (local_expiry = localtime ((time_t *)cookie-expiry_time)) + ? asctime (local_expiry) + : UNKNOWN), cookie-attr, cookie-value)); }
RE: Mapping URLs to filenames
On 16 Jan 2002 at 8:02, David Robinson (AU) wrote: In the meantime, however, '?' is problematic for Win32 users. It stops WGET from working properly whenever it is found within a URL. Can we fix it please. My proposal for using escape sequences in filenames for problem characters is up for discussion at the moment, but I'm not sure if they really need to be reversible (except that it helps to reduce the chances of different URLs being saved to the same filename). Would it be sufficient to map all illegal characters to '@'? For Windows, the code already changes '%' to '@' and it could just as easily change '*', '?', etc. to '@' as well.
Re: Passwords and cookies
On 15 Jan 2002 at 14:48, Brent Morgan wrote: Thanks to everyone for looking at this problem. I am not a developer and at my wits end with this problem. I did determine with a different cookie required site that it is still not working. Could you change line 1017 of cmpt.c to read as follows: get_number (0, 2038); (i.e. change 2036 to 2038). Then recompile. That might be enough to stop the wget from crashing with the -d option. If debugging now works, can you supply some debug log output for your Set-Cookie problem? I will keep my eye for future windows compilations and keep trying. That relies on having decent information to debug the problem.
A strange bit of HTML
I came across this extract from a table on a website: td ALIGN=CENTER VALIGN=CENTER WIDTH=120 HEIGHT=120a href=66B27885.htm msover1('Pic1','thumbnails/MO66B27885.jpg'); onMouseOut=msout1('Pic1','thumbnails/66B27885.jpg');img SRC=thumbnails/66B27885.jpg NAME=Pic1 BORDER=0 /a/td Note the string beginning msover1(, which seems to be an attribute value without a name, so that makes it illegal HTML. I haven't traced what Wget is actually doing when it encounters this, but it doesn't treat 66B27885.htm as a URL to be downloaded. I can't call this a bug, but is Wget doing the right thing by ignoring the href altogether?
Re: Passwords and cookies
On 15 Jan 2002 at 0:27, Hrvoje Niksic wrote: Brent Morgan [EMAIL PROTECTED] writes: The -d debug option crashes wget just after it reads the input file. Huh? Ouch! Wget on Windows is much less stable than I imagined. Can you run it under a debugger and see what causes the crash? I had ago at building wget 1.8.1 myself on Windows 2000 with VC 6.0 and also got the crash when using the -d option, so I upgraded to VC 6.0 SP2 and it did the same thing. I've narrowed it down to the following line in cookies.c asctime (localtime ((time_t *)cookie-expiry_time)), which is part of a DEBUGP macro call from function store_cookie. Specifically, it was failing on the asctime call, rather than the localtime call, but that's as far as I got. A casual glance at C runtime library source supplied with the compiler revealed no obvious problem, but I'll try and investigate this problem a bit more.
Mapping URLs to filenames
This is an initial proposal for naming the files and directories that Wget creates, based on the URLs of the retrieved documents. At the moment there are many complaints about Wget failing to save documents which have '?' in their URLs when running under Windows, for example. In general, the set of illegal characters in file-names depends on the the operating system and the file-system in use. Wget can be compiled for different operating systems, but doesn't know which file-system is being used - you may get the oddball who wants to save files to a vfat file-system from Linux for example! Therefore, there should be some way to override or augment the set of illegal filename characters using a wgetrc command, for example. File-names used within the internals of Wget need to be converted to an external form which deals with illegal characters or illegal sequences of characters in the file-name. The internal filename consists of directory separators ('/'), illegal characters, a nominated 'escape' character and other (legal) characters. Illegal characters in the internal file-name can be mapped to an escape sequence in the external file-name, consisting of the escape character followed by two hex digits (it is assumed that both the escape character and the hex digits are legal file-name characters for the operating system and file-system in use!). Escape characters in the internal file-name can be mapped to an escape sequence in the same way. The directory separator character ('/') in the internal file-name is usually mapped to the directory hierachy on the file-system, but if the internal file-name contains two or more consecutive directory separator characters, some of these will need to be escaped to avoid trying to create directories with null names. (An alternate solution is to create a directory whose name consists solely of a single escape character.) The external file-names are easily reversible back to the internal form when necessary. The obvious candidate for the escape character is the '%' character, although the escape mechanism for file-names is logically distinct from the escape mechanism for HTTP. The current version of Wget for Windows remaps all '%' characters to '@', so perhaps '@' is a better candidate for the escape character for Windows. (I'm not sure why Wget does this, as '%' seems to be a legal file-name character for Windows and MS-DOS. Perhaps it is for usability reasons due to the command shell's variable interpolation of '%name%' sequences.) The escape character can be made operating system dependent, and perhaps could be overridden with a wgetrc command. That's my initial proposal anyway. I'm not sure about things such as UTF-8 should be handled, or if that's an issue at all.
Re: 2 Gb limitation
On 10 Jan 2002 at 17:09, Matt Butt wrote: I've just tried to download a 3Gb+ file (over a network using HTTP) with WGet and it died at exactly 2Gb. Can this limitation be removed? In principle, changes could be made to allow wget to be configured for large file support, by using the appropriate data types (i.e. 'off_t' instead of 'long'). The logging code would be more complicated as there is no portable way to handle the data type in a printf-style function, so these would have to be converted to strings by a bespoke routine and the converted strings passed to the printf-style function. This would also slow down the operation of wget a little bit. A version of wget configured for large file support would also be slower in general than a version not configured for large file support - at least on a 32-bit machine. Large file support should probably be added to the TODO list at least. Quite a few people use wget to download .iso images of CD-ROMs at the moment; in the future, those same people are likely to want to use wget to download DVD-ROM images!
Re: Using -pk, getting wrong behavior for frameset pages...Suggestions?
On 11 Jan 2002 at 10:51, Picot Chappell wrote: Thanks for your response. I tried the same command, using your URL, and it worked fine. So I took a look at the site I was retrieving for the failed test. It's a ssl site (didn't think about it before) and I noticed 2 things. The Frame source pages were not downloaded (they were for www.mev.co.uk) and the links were converted to full URLs. ie. FRAME src=menulayer.cgi. became FRAME src=https://www.someframed.page/menulayer.cgi; ... So the content was still reachable, but not really local (this is the original problem). I tried it without the --convert-links, and the frame source remained defined as menulayer.cgi but menulayer.cgi was not downloaded. Do you think this might be an issue with framesets and ssl sites? or an issue with framesets and cgi source files? Do you have SSL support compiled in? Also it is possible that the .cgi script on the server is checking HTTP request headers and cookies, doesn't like what it sees and is returning an error. It is sometimes useful to lie to the server about the HTTP user agent using the -U option, e.g.: -U Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0) or include something similar in the wgetrc file: useragent = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0) Some log entries would be useful, particularly with the -d option. You can mask any sensitive bits of the log if you want.
Re: Simplest logfile ?
On 8 Jan 2002 at 20:31, Mike wrote: What I'm looking for is something like the way FTP_Lite operates, Can I nominate a single log file in the wgetrc for use by all the wget processes that spawn off from my bash ? There is the -a FILE (--append-output=FILE) option to append to a logfile. A combination of -b, -nv and -a FILE should do more or less what you want. It may be possible for the log file to become mangled if more than one wget process writes to the log file at the same time, but I don't think that will be a problem with -nv logging unless the lines are very long (due to downloading from a humungous URL, say).
Re: Asseertion failed in wget
On 7 Jan 2002 at 11:52, Jan Starzynski wrote: for GNU Wget 1.8 I get the following assertion failed message: get: progress.c:673: create_image: Zusicherung »p - bp-buffer = bp-width« nicht erfüllt. (snip) In the changelogs of 1.8.1 I could not find a hint that this has been fixed until now. It _has_ been fixed in 1.8.1, but the ChangeLog entry only mentions the bug that was fixed, not its symptoms. FWIW, here is the ChangeLog entry (complete with typo:-) 2001-12-09 Hrvoje Niksic [EMAIL PROTECTED] * progress.c (create_image): Fix ETA padding when hours are prined.
Re: wget does not treat urls starting with // correctly
On 4 Jan 2002 at 12:22, Bastiaan Stougie wrote: wget -P $LOCALDIR -m -np -nH -p --cut-dirs=2 http://host/dir1/dir2/ This works fine, except that wget does not follow all the urls. It skips urls like: A HREF=//host/dir1/dir2/filetext/A Here is a proposed patch to fix that. src/ChangeLog entry: 2002-01-07 Ian Abbott [EMAIL PROTECTED] * url.c (uri_merge_1): Deal with net path relative URL (one that starts with //). And the actual patch: Index: src/url.c === RCS file: /pack/anoncvs/wget/src/url.c,v retrieving revision 1.67 diff -u -r1.67 url.c --- src/url.c 2001/12/14 15:45:59 1.67 +++ src/url.c 2002/01/07 15:30:41 @@ -1575,6 +1575,35 @@ memcpy (constr + baselength, link, linklength); constr[baselength + linklength] = '\0'; } + else if (linklength 1 *link == '/' *(link + 1) == '/') + { + /* LINK begins with // and so is a net path: we need to +replace everything after (and including) the double slash +with LINK. + +So, if BASE is http://oldhost/whatever/foo/bar;, and LINK +is //newhost/qux/xyzzy, our result should be +http://newhost/qux/xyzzy;. */ + int span; + const char *slash; + const char *start_insert; + /* Look for first slash. */ + slash = memchr (base, '/', end - base); + /* If found slash and it is a double slash, then replace +from this point, +else default to replacing from the beginning. */ + if (slash *(slash + 1) == '/') + start_insert = slash; + else + start_insert = base; + + span = start_insert - base; + constr = (char *)xmalloc (span + linklength + 1); + if (span) + memcpy (constr, base, span); + memcpy (constr + span, link, linklength); + constr[span + linklength] = '\0'; + } else if (*link == '/') { /* LINK is an absolute path: we need to replace everything
Re: [no subject]
On 3 Jan 2002 at 13:58, Henric Blomgren wrote: Wget-bug: GNU Wget 1.8 [...] [root@MAGI .temporary]# wget: progress.c:673: create_image: Assertion `p - bp-buffer = bp-width' failed. Please use Wget 1.8.1. That bug has already been fixed!
Re: Wget 1.8.1-pre2 Problem with -i, -r and -l
On 18 Dec 2001 at 23:13, Hrvoje Niksic wrote: Ian Abbott [EMAIL PROTECTED] writes: If I have a website http://somesite/ with three files on it: index.html, a.html and b.html, such that index.html links only to a.html and a.html links only to b.html then the following command will retrieve all three files: wget -r -l 1 http://somesite/index.html http://somesite/a.html Does it? For me this command retrieves only `index.html' and `a.html', and that's a bug. `-i list' makes no different. Well that's how it behaved for me, but actually I was using pre2+cvs (src/CVS/Entries at [1]). Another difference was that when the URLs were specified on the command-line, a.html was downloaded twice. I repeated the test with make distclean, ./configure --with-ssl, make and it behaved the same. With your latest CVS updates (see [2]) the -i option now behaves correctly - i.e. it downloads all three files. However, the command which specified index.html and a.html on the command-line still downloads a.html twice. [1] Here is the src/CVS/Entries I used for the behavior I originally observed: /alloca.c/1.1.1.1/Thu Dec 2 07:42:27 1999// /ansi2knr.c/1.1.1.1/Thu Dec 2 07:42:26 1999// /fnmatch.c/1.2/Sun May 27 19:34:56 2001// /getopt.c/1.1.1.1/Thu Dec 2 07:42:26 1999// /getopt.h/1.1.1.1/Thu Dec 2 07:42:26 1999// /init.h/1.2/Sun May 27 19:35:04 2001// /rbuf.h/1.4/Sun May 27 19:35:09 2001// /safe-ctype.c/1.1/Fri Mar 30 22:36:59 2001// /safe-ctype.h/1.2/Fri Apr 27 05:03:08 2001// D/ChangeLog-branches /gnu-md5.c/1.1/Sun Nov 18 04:36:20 2001// /gnu-md5.h/1.1/Sun Nov 18 04:36:20 2001// /hash.c/1.14/Tue Nov 20 11:47:32 2001// /headers.c/1.6/Tue Nov 20 11:47:32 2001// /html-parse.c/1.9/Tue Nov 20 11:47:32 2001// /ftp.h/1.11/Thu Nov 22 10:36:15 2001// /Makefile.in/1.17/Mon Nov 26 10:46:17 2001// /recur.h/1.4/Mon Nov 26 10:46:21 2001// /retr.h/1.10/Mon Nov 26 18:11:36 2001// /connect.c/1.11/Tue Nov 27 15:37:00 2001// /mswindows.h/1.5/Thu Nov 29 15:57:38 2001// /connect.h/1.6/Fri Nov 30 10:28:03 2001// /cookies.h/1.4/Fri Nov 30 10:28:03 2001// /fnmatch.h/1.3/Fri Nov 30 10:28:03 2001// /ftp-opie.c/1.6/Fri Nov 30 10:28:03 2001// /gen-md5.c/1.2/Thu Nov 29 18:48:42 2001// /gen-md5.h/1.2/Thu Nov 29 18:55:52 2001// /hash.h/1.5/Fri Nov 30 10:28:03 2001// /headers.h/1.4/Fri Nov 30 10:28:03 2001// /html-parse.h/1.3/Fri Nov 30 10:28:03 2001// /netrc.c/1.10/Fri Nov 30 10:28:03 2001// /netrc.h/1.3/Fri Nov 30 10:28:03 2001// /options.h/1.24/Fri Nov 30 10:28:03 2001// /res.h/1.3/Fri Nov 30 10:28:04 2001// /cmpt.c/1.10/Fri Nov 30 13:11:47 2001// /sysdep.h/1.19/Fri Nov 30 10:28:05 2001// /ftp.c/1.52/Mon Dec 3 19:13:15 2001// /rbuf.c/1.6/Wed Dec 5 11:16:05 2001// /gen_sslfunc.h/1.6/Thu Dec 6 10:23:13 2001// /progress.h/1.4/Thu Dec 6 10:23:14 2001// /snprintf.c/1.6/Wed Dec 5 11:16:09 2001// /url.h/1.22/Thu Dec 6 10:23:14 2001// /config.h.in/1.20/Mon Dec 10 11:30:41 2001// /cookies.c/1.18/Mon Dec 10 11:30:41 2001// /ftp-basic.c/1.15/Mon Dec 10 11:30:41 2001// /log.c/1.11/Mon Dec 10 11:30:42 2001// /main.c/1.68/Mon Dec 10 11:30:42 2001// /mswindows.c/1.8/Mon Dec 10 11:30:42 2001// /progress.c/1.23/Mon Dec 10 11:30:42 2001// /wget.h/1.31/Mon Dec 10 11:30:43 2001// /ftp-ls.c/1.22/Tue Dec 11 11:37:06 2001// /host.c/1.32/Tue Dec 11 11:37:06 2001// /host.h/1.8/Tue Dec 11 11:37:06 2001// /html-url.c/1.22/Thu Dec 13 10:47:32 2001// /res.c/1.6/Thu Dec 13 10:47:32 2001// /init.c/1.44/Mon Dec 17 10:52:57 2001// /url.c/1.67/Mon Dec 17 10:52:57 2001// /gen_sslfunc.c/1.15/Tue Dec 18 11:32:42 2001// /http.c/1.82/Mon Dec 17 19:56:57 2001// /retr.c/1.49/Tue Dec 18 11:32:42 2001// /utils.c/1.43/Tue Dec 18 11:32:42 2001// /utils.h/1.17/Tue Dec 18 11:32:42 2001// /version.c/1.26/Tue Dec 18 11:32:42 2001// /ChangeLog/1.333/Tue Dec 18 18:59:29 2001// /recur.c/1.38/Result of merge// N.B. Although the entry for recur.c says Result of merge, it is in fact identical to -r1.38 on the server. [2] Here is the updated src/CVS/Entries after your fixes: /alloca.c/1.1.1.1/Thu Dec 2 07:42:27 1999// /ansi2knr.c/1.1.1.1/Thu Dec 2 07:42:26 1999// /fnmatch.c/1.2/Sun May 27 19:34:56 2001// /getopt.c/1.1.1.1/Thu Dec 2 07:42:26 1999// /getopt.h/1.1.1.1/Thu Dec 2 07:42:26 1999// /init.h/1.2/Sun May 27 19:35:04 2001// /rbuf.h/1.4/Sun May 27 19:35:09 2001// /safe-ctype.c/1.1/Fri Mar 30 22:36:59 2001// /safe-ctype.h/1.2/Fri Apr 27 05:03:08 2001// D/ChangeLog-branches /gnu-md5.c/1.1/Sun Nov 18 04:36:20 2001// /gnu-md5.h/1.1/Sun Nov 18 04:36:20 2001// /hash.c/1.14/Tue Nov 20 11:47:32 2001// /headers.c/1.6/Tue Nov 20 11:47:32 2001// /ftp.h/1.11/Thu Nov 22 10:36:15 2001// /Makefile.in/1.17/Mon Nov 26 10:46:17 2001// /recur.h/1.4/Mon Nov 26 10:46:21 2001// /retr.h/1.10/Mon Nov 26 18:11:36 2001// /connect.c/1.11/Tue Nov 27 15:37:00 2001// /mswindows.h/1.5/Thu Nov 29 15:57:38 2001// /connect.h/1.6/Fri Nov 30 10:28:03 2001// /cookies.h/1.4/Fri Nov 30 10:28:03 2001// /fnmatch.h/1.3/Fri Nov 30 10:28:03 2001// /ftp-opie.c/1.6/Fri Nov 30 10:28:03 2001
Re: Error while compiling Wget 1.8.1-pre2+cvs.
On 19 Dec 2001 at 17:40, Alexey Aphanasyev wrote: Hrvoje Niksic wrote: The `gnu-md5.o' object is missing. Can you show us the output from `configure'? Yes, sure. Please find it attached bellow. Have you tried running make distclean before ./configure? It is possible that some of your cached configuration results have become stale.
Wget 1.8+CVS not passing referer for recursive retrieval
Although retrieve_tree() stores and retrieves referring URLs in the URL queue, it does not pass them to retrieve_url(). This seems to have got lost during the transition from depth-first to breadth- first retrieval. This means that HTTP requests for URLs being retrieved at depth greater than 0 have the Referer set to that set by the --referer option or nothing at all, and not necessarily the URL of the referring page. src/ChangeLog entry: 2001-12-18 Ian Abbott [EMAIL PROTECTED] * recur.c (retrieve_tree): Pass on referring URL when retrieving recursed URL. Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.37 diff -u -r1.37 recur.c --- src/recur.c 2001/12/13 19:18:31 1.37 +++ src/recur.c 2001/12/18 13:28:58 @@ -237,7 +237,7 @@ int oldrec = opt.recursive; opt.recursive = 0; - status = retrieve_url (url, file, redirected, NULL, dt); + status = retrieve_url (url, file, redirected, referer, dt); opt.recursive = oldrec; if (file status == RETROK
Wget 1.8.1-pre2 Problem with -i, -r and -l
I don't have time to look at this problem today, but I thought I'd mention it now to defer the 1.8.1 release. If I have a website http://somesite/ with three files on it: index.html, a.html and b.html, such that index.html links only to a.html and a.html links only to b.html then the following command will retrieve all three files: wget -r -l 1 http://somesite/index.html http://somesite/a.html However, if I then create a file 'list' containing the lines: http://somesite/index.html http://somesite/a.html and issue the command: wget -r -l 1 -i list then only index.html and a.html are retrieved. I think wget should also retrieve b.html which is linked to by b.html, i.e. treat the URLs in the file as though they were specified on the command line.
Re: A small bug
On 14 Dec 2001 at 14:49, Peng GUAN wrote: Maybe a bug in file fnmatch.c, line 54: ( n==string || (flags FNM_PATHNAME) n[-1] == '/')) the n[-1] should be change to *(n-1). I like the easy ones. Those are equivalent in C. As to which of the too looks the nicest is a matter of aesthetics and also depends on the style of the surronding source code. At least both of the above look nicer than (-1)[n] which is also equivalent to the above, but its usage is reserved for ubfuscated C coding competitions!
Re: Is wget --timestamping URL working on Windows 2000?
On 11 Dec 2001 at 18:40, [EMAIL PROTECTED] wrote: It seems to me that if an output_document is specified, it is being clobbered at the very beginning (unless always_rest is true). Later in http_loop stat() comes up with zero length. Hence there's always a size mismatch when --output-document is specified. That doesn't sound good to me... But it's as documented in the man page. The option is meant for concatenating several pages into one big file, and you can't meaningfully compare timestamps or file sizes in that case.
Re: log errors
On 11 Dec 2001 at 16:09, Hrvoje Niksic wrote: Summer Breeze [EMAIL PROTECTED] writes: Here is a sample entry: 66.28.29.44 - - [08/Dec/2001:18:21:20 -0500] GET /index4.html%0A HTTP/1.0 403 280 - Wget/1.6 /index4.html%0A looks like a page is trying to link to /index4.html, but the link contains a trailing newline. If that is the case, you may be able to track down the referring page if that is also logged. Another possibility is that someone is running a (UNIX) command like this: $ wget 'http://motherbird.com/index4.html ' (The '$' and '' in the above are just shell prompts, not part of the command.) I just tried that myself and saw that Wget was trying to retrieve http://motherbird.com/index4.html%0A; as in your log file and got an ERROR 403: Forbidden back.
Re: Make -p work with framed pages.
On 1 Dec 2001 at 4:04, Hrvoje Niksic wrote: As a TODO entry summed up: * -p should probably go _two_ more hops on FRAMESET pages. More generally, I think it probably needs to be made to work for nested framesets too.
Re: windows patch and problem
On 29 Nov 2001 at 12:48, Herold Heiko wrote: --12:27:26-- http://www.cnn.com/ (try: 3) = `www.cnn.com/index.html' Found www.cnn.com in host_name_addresses_map (008D01B0) Releasing 008D01B0 (new refcount 1). Retrying. (ecc.) Same with other hosts Could somebody please confirm if this is a problem with my build ? No, it happens on my Linux build too. Something broke.
Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)
On 29 Nov 2001 at 13:14, Daniel Stenberg wrote: On Thu, 29 Nov 2001, Maciej W. Rozycki wrote: On Wed, 28 Nov 2001, Ian Abbott wrote: However, the Linux man page for bcopy(3) do not say the strings can overlap Presumably the man page is incorrect Yes, I think so. Well, can we actually guarantee that bcopy() will work on all platforms where memmove() is not present? HAVE_BCOPY? I wouldn't be so bold to say that. I'd vote for a separate implemenation. But that's just me. That's the easiest thing to do. It's only used at the moment for removing duplicate outgoing cookies. I don't know how often you get duplicate cookies, and the current mechanism for removing them isn't all that efficient when there are multiple duplicates to be removed anyway!
Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)
On 29 Nov 2001 at 14:40, Hrvoje Niksic wrote: Ian, can you clarify what you meant by BSD man pages? Which BSD? NetBSD: http://www.tac.eu.org/cgi-bin/man-cgi?bcopy+3 OpenBSD: http://www.openbsd.org/cgi-bin/man.cgi?query=bcopysektion=3 FreeBSD: http://www.freebsd.org/cgi/man.cgi?query=bcopysektion=3 Those are all pretty much identical and say that the strings can overlap and that a bcopy function appeared in BSD4.2. SunOS 4.1.3: http://www.freebsd.org/cgi/man.cgi?query=bcopysektion=3manpath=SunOS+4.1.3 That one aliases SunOS 4.1.3's bstring(3) man page which describes a group of related functions (including bcopy). It also says the strings can overlap.
Re: wget1.7.1: Compilation Error (please Cc'ed to me :-)
On 28 Nov 2001 at 18:08, Hrvoje Niksic wrote: Daniel Stenberg [EMAIL PROTECTED] writes: On Wed, 28 Nov 2001, zefiro wrote: ld: Undefined symbol _memmove Do you have any suggestion ? SunOS 4 is known to not have memmove. May I suggest adding the following (or similiar) to a relevant wget source file: [...] Thanks for the suggestion and the code example. Two points, though: * Isn't it weird that the undefined symbol is _memmove, not memmove? It looks as if a header file is translating the symbol, thinking that _memmove exists. Not really. UNIX C compilers of old prefix C external symbols with '_'. GCC doesn't do that unless targetted for a system that uses the prefix in its standard system library symbols. * As a BSD offshoot, SunOS almost certainly has bcopy. Could we make use of it? I seem to remember reading that BSD bcopy is supposed to handle overlapping blocks, but I cannot find a confirmation right now. If that were the case, we could simply use this: #ifndef HAVE_MEMMOVE # define memmove(to, from, len) bcopy(from, to, len) #endif That ought to work as the SunOS and BSD man pages say that the strings can overlap. However, the Linux man page for bcopy(3) do not say the strings can overlap and in fact suggest that it be replaced with memcpy in new programs! Linux has memmove so that does not matter, but perhaps rolling our own memmove as Daniel suggested would be the safest option. Another difference between bcopy() and memmove() is that bcopy() returns void whereas memmove returns a pointer, but in the one place in the Wget source where memmove() is called, the return value is not used.
Re: HAVE_RANDOM ?
On 27 Nov 2001, at 15:16, Hrvoje Niksic wrote: So, does anyone know about the portability of rand()? It's in the ANSI/ISO C spec (ISO 9899). It's always been in UNIX (or at least it's been in there since UNIX 7th Edition), and I should think it's always been in the MS-DOS compilers, but I don't have one handy at the moment. It tends not to be very random in some implementations, but should be good enough to implement a random wait.
wget-1.8-dev Segmentation fault when retrieving from file
I got a segmentation fault when retrieving URLs from a file. 2001-11-27 Ian Abbott [EMAIL PROTECTED] * retr.c (retrieve_from_file): Initialize `new_file' to NULL to prevent seg fault. Index: src/retr.c === RCS file: /pack/anoncvs/wget/src/retr.c,v retrieving revision 1.41 diff -u -r1.41 retr.c --- src/retr.c 2001/11/26 20:07:13 1.41 +++ src/retr.c 2001/11/27 18:31:12 @@ -538,7 +538,7 @@ for (cur_url = url_list; cur_url; cur_url = cur_url-next, ++*count) { - char *filename = NULL, *new_file; + char *filename = NULL, *new_file = NULL; int dt; if (cur_url-ignore_when_downloading)
Re: Does the -Q quota command line argument work?
On 27 Nov 2001 at 13:07, John Masinter wrote: It seems that wget will download an entire large file regardless of what I specify for the quota. For example I am trying to download only the first 100K of a 800K file. I specify this: wget -Q 100K http://url-goes-here It then proceeds to download the entire 800K file. I've also tried using the --quota=100K form as well as -Q 10 and nothing seems to work. Did I misinterpret the purpose of this argument? Yes, as it says in the manual, the quota will never affect downloading a single file. You may wish to try out the new --range option in wget 1.8-dev (available via anonymous CVS), or wait until wget 1.8 comes out.