Re: Wikipedia page
Good idea Hrvoje. Oliver Hrvoje Niksic wrote: "Oliver Schulze L." <[EMAIL PROTECTED]> writes: I think that having a link to an email address is not that usefull, because people can just write to that email address because its a mailling list. Good point. An even better link might be to the gmane archive, where you can read the list, but which also allowed posting. -- Oliver Schulze L. <[EMAIL PROTECTED]>
No more Libtool (long)
Thanks to the effort of Mauro Tortonesi and the prior work of Bruno Haible, Wget has been modified to no longer use Libtool for linking in external libraries. If you are interested in why that might be a cause for celebration, read on. A bit of history: Libtool was integrated in Wget by Dan Harkless, despite protests (see http://tinyurl.com/98zkt), to ensure portable linking to external libraries. Linking with a system library, such as librt or libpthread is as easy as using -lLIBNAME. However, linking to third-party libraries installed in /usr/local/lib or elsewhere is harder because: a) you have to find the location of the library, and b) you have to produce an executable with runtime path information to find the library when it is run (the system's dynamic linker cannot be expected to know about non-standard library locations). The b) part is really tricky because the compiler and linker flags vary from system to system, and it is hard or impossible to get access to a large number of different systems to test it on. For example, on Linux you would use -Wl,-rpath /usr/local/lib, on Solaris you would use -R/usr/local/lib, on AIX you might use -Wl,-blibpath /usr/local/lib:/usr/lib:/lib, and so on. Of course, the "-Wl," part also differs between compilers. And the GNU linker may be used by GCC on some of the systems, which means you have to use its flags, not the system ones. And so on -- you get the idea. Libtool, normally used for *building* shared libraries, can also be used to help link them in because it contains code that handles the above runpath conundrum. It supports a unified interface where make can simply use -R/usr/local/lib, and depend on libtool to convert that to the incantation appropriate for the system linker. configure.in was made to detect OpenSSL in this way; however, what started as a simple use of libtool turned into 200 lines of *hard* configure.in code. Despite the apparent improvement over simply not specifying the runpath, and arguably over trying to duplicate libtool's rpath logic in configure.in, Libtool brought many painful disadvantages, which I will proceed to list, in no particular order: * It made the configure script much larger and slower, excercising many weird and unnecessary checks, such as how to run one of ~20 supported FORTRAN compilers, how to parse output of `nm', how to run the C++ preprocessor, where to find `ar', `ranlib', and `strip', how to produce PIC, how to tell C++ not to use RTTI and exceptions (!), and so on. * It is unclear what would happen if some of the checks Libtool thinks are important (the nm one comes to mind) failed on a platform on which the rest of Wget builds just fine. The experience with a Libtool version that caused Wget to fail to build when there was no C++ compiler on the platform suggests the worst. * Such use of Libtool is complete overkill. While Libtool may be the appropriate solution for building shared libraries (although there are opposing views to that), it was certainly not designed with this use in mind, which is amply proven by the amount of documentation devoted to the issue -- none. * The merge of Wget's configure and libtool was far from clean, simply because such use was not envisioned and is therefore not documented. It involved digging into Autoconf internals, such as unsetting cache-related variables, temporarily changing CC to "$SHELL ./libtool $CC" and then reverting it, hackery to LDFLAGS and LIBS, and more. * It was completely specific to OpenSSL's libssl and libcrypto, and non-reusable to other external libraries. Adding a *new* external library would have required rethinking the entire scheme, and possibly rewriting that very tricky code. * Libtool created unnecessary cruft, such as the .libs directories, and unnecessary restrictions, such as the inability to use `make CC=some-other-compiler' without rerunning configure, even though the other compiler would work just fine with the Makefile variables currently available -- for example, it can be another version of gcc. (This had to do with the "tags" feature of Libtool that the documentation didn't explain sufficiently to turn it off.) * The complex and fragile Libtool code base required frequent updates. Some versions of Wget didn't compile on otherwise unexceptional operating systems simply because of Libtool bugs. While it can be argued that all software requires updates in one form or another, Libtool has required much more hand-holding than other software we use to build Wget, for example Autoconf. * IT DIDN'T WORK, despite all the invested effort. After Wget 1.10 was released, this list received reports of OpenSSL libraries not being detected on some operating systems, apparently because Libtool insisted on creating executable in the .libs directory (where the Autoconf test system doesn't find them). Of course, Libtool doesn't do that on Linux, nor on So
Re: wget bug report
<[EMAIL PROTECTED]> writes: > Sorry for the crosspost, but the wget Web site is a little confusing > on the point of where to send bug reports/patches. Sorry about that. In this case, either address is fine, and we don't mind the crosspost. > After taking a look at it, i implemented the following change to > http.c and tried again. It works for me, but i don't know what other > implications my change might have. It's exactly the correct change. A similar fix has already been integrated in the CVS (in fact subversion) code base. Thanks for the report and the patch.
Re: Bug handling session cookies
"Mark Street" <[EMAIL PROTECTED]> writes: > Many thanks for the explanation and the patch. Yes, this patch > successfully resolves the problem for my particular test case. Thanks for testing it. It has been applied to the code and will be in Wget 1.10.1 and later.
Re: Wikipedia page
"Oliver Schulze L." <[EMAIL PROTECTED]> writes: > I think that having a link to an email address is not that usefull, > because people can just write to that email address because its a > mailling list. Good point. An even better link might be to the gmane archive, where you can read the list, but which also allowed posting.
Re: Removing thousand separators from file size output
"Tony Lewis" <[EMAIL PROTECTED]> writes: > Hrvoje Niksic wrote: > >> In fact, I know of no application that accepts numbers as Wget prints > them. > > Microsoft Calculator does. Sorry, I forgot to qualify that as "(Unix) command-line application" or something to that effect. I know that many GUI applications, such as Excel, accept numbers in a variety of formats, including (depending on locale and possibly number format customizations) that one.
Re: Bug handling session cookies
Hrvoje, Many thanks for the explanation and the patch. Yes, this patch successfully resolves the problem for my particular test case. Best regards, Mark Street.
Re: Removing thousand separators from file size output
Leonid <[EMAIL PROTECTED]> writes: > Those guys who find numbers like 11782023180 easy to read and can > tell for a fraction of a second that it was 11Gb I'm not such person; Wget would in fact print: Length: 11782023180 (11.0G)
Re: Wikipedia page
Added :) I think that having a link to an email address is not that usefull, because people can just write to that email address because its a mailling list. Is just an Idea, if you want to can revert the changes. Thanks Oliver Hrvoje Niksic wrote: "Oliver Schulze L." <[EMAIL PROTECTED]> writes: Looks really nice. Maybe it needs a link to instructions on how to subscribe to the mailling list. You can always add it. :-) But we already have a link to the home page where the information resides. Link to subscription details probably don't belong to a wikipedia article. -- Oliver Schulze L. <[EMAIL PROTECTED]>
RE: Removing thousand separators from file size output
Hrvoje Niksic wrote: > In fact, I know of no application that accepts numbers as Wget prints them. Microsoft Calculator does. Tony
RE: Removing thousand separators from file size output
Hrvoje, > What do you think? To add a new (oh!) option in .wgetrc and call it decimal_separator Those guys who find numbers like 11782023180 easy to read and can tell for a fraction of a second that it was 11Gb downloaded, not 1.1Gb, will use decimal_separator = "" I personally would specify decimal_separator = "," Germans may like more decimal_separator = "." Leonid
Re: ChangeLog-branches
Alain Bench <[EMAIL PROTECTED]> writes: > MHO: They are ununderstandable, unusable, unclean, and big. They may > give a false bad impression of source/project misorganization. We > want to drop them, wipe any proof of their existence from any > archives and mirrors, then honestly deny they ever existed. No need > to kill witnesses though: Who would believe them? The pesky subversion software allows their restoration... I *knew* we should have stuck to CVS! :-) (They're gone.)
Re: Removing thousand separators from file size output
Alain Bench <[EMAIL PROTECTED]> writes: > On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote: > >> Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am >> thinking of removing the thousand separators from size display. > > IMHO thousand (or myriad) separators are necessary. > > This size display is primarily intended for humans, not for other > apps. Primarily yes -- which is why Wget 1.10 also shows the size in units. But it is also convenient as input for other applications, which is very hard with the thousand separators. > If separators constitute a difficulty for other apps, then it's > these other apps problem. Or sed's task (s/,//g). It's not the other apps problems. In many applications (e.g. programming languages, but also programmable calculators) "," is a separator between function arguments and cannot be used inside the number. In fact, I know of no application that accepts numbers as Wget prints them. (sed is not readily available when I paste Wget's output into another program such as bc or calc.) > Humans can have habit to look at exact unit size, or rounded > kilo/mega/tera size, or both. It would be a regression to reduce > readability of legacy exact bytes count, The way I see it, with the unit sizes present, omitting the thousand separators merely removes redundancy, not useful information. More importantly, I know of no other command-line program that prints sizes with thousand separators the way Wget does, with no way to get the ordinary parsable numbers. If the users were so used to separators, they would surely request them in other programs, such as `ls', `du', or `df'? > I don't really understand nor follow your reasons against > localization. User's cultural preferences should be respected. You can make a case that the correct character and layout should be used for digit grouping when it is deployed, but I don't see how you can argue that grouping *must* be used in all applications! The appearance of grouped digits can be and is described by the locale, but no locale mandates grouping to be used for display of all numbers. As for localization, I'm not against it. The argument was that, where possible, I prefer the output of applications to remain parsable. For example, I consider the ISO 8601 date format a clear advantage over the asctime() format. The same goes for the display of integers.
Re: Removing thousand separators from file size output
On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote: > Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am > thinking of removing the thousand separators from size display. IMHO thousand (or myriad) separators are necessary. This size display is primarily intended for humans, not for other apps. If separators constitute a difficulty for other apps, then it's these other apps problem. Or sed's task (s/,//g). Humans can have habit to look at exact unit size, or rounded kilo/mega/tera size, or both. It would be a regression to reduce readability of legacy exact bytes count, just because we have a new added more human-readable but rounded count. > The separators are interpunction which introduces clutter, especially > with complex size output also containing the "remaining" size next to > the whole size. True: The more info, the more confusion. But that's the contrary of a valid reason to reduce readability of those infos. And IMHO removing thousand separators reduces readability. > replace the "," character with the character mandated by the locale This seems naturally desirable. I don't really understand nor follow your reasons against localization. User's cultural preferences should be respected. OTOS this is not so important nor urgent, compared to thousand serparators removal cons. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: ChangeLog-branches
Hello Hrvoje, On Thursday, June 23, 2005 at 9:00:44 PM +0200, Hrvoje Niksic wrote: > the ChangeLog-branches directories distributed with Wget are desirable > or necessary? MHO: They are ununderstandable, unusable, unclean, and big. They may give a false bad impression of source/project misorganization. We want to drop them, wipe any proof of their existence from any archives and mirrors, then honestly deny they ever existed. No need to kill witnesses though: Who would believe them? Bye!Alain. -- Microsoft Outlook Express users concerned about readability: For much better viewing quotes in your messages, check the little freeware program OE-QuoteFix by Dominik Jain on http://flash.to/oblivion/>. It'll change your life. :-) Now exists also for Outlook.
Re: Getting the list of the files to download before downloading them
On 6/21/05, Isaac Grover <[EMAIL PROTECTED]> wrote: > > > I wonder if someone on the list could come up with a sed one-liner? > > > Or a snippet of perl perhaps. It should be trivial to take a > > > directory of html files, extract html tags that bracket each URL that > > > mention a PDF file, and write a pseudo-HTML file that contains only > > > the PDF links for wget. > > I don't know sed, and it wouldn't be hard to do in perl I suppose, but this > is more or less what I use: > > #!/bin/sh > > wget http://www.example.com/links/ > grep "http://"; index.html > index.txt > cat index.txt | awk 'BEGIN { FS="\"" } { print $2 }' > url_list.txt > > Then if you wanted to only grab the PDF files, do: > > grep "\.pdf" url_list.txt > new_url_list.txt > wget -i new_url_list.txt > > It is just after midnight here, so it may not work exactly as advertised, > but cut-n-paste usually doesn't lie, so it should work okay. Thanks, Isaac, but as far as I understand your script, it does not apply with wget recursion. Paul
Re: Bug handling session cookies
"Mark Street" <[EMAIL PROTECTED]> writes: > I'm not sure why this [catering for paths without a leading /] is > done in the code. rfc1808 declared that the leading / is not really part of path, but merely a "separator", presumably to be consistent with its treatment of ;params, ?queries, and #fragments. The author of the code found it appealing to disregard common sense and implement rfc1808 semantics. In most cases the user shouldn't notice the difference, but it has lead to all kinds of implementation problems with code that assumes that URL paths naturally begin with /. Because of that it will be changed later. > Note that the forward slash is stripped from "prefix", hence never > matches "full_path". I'm not sure why this is done in the code. Because PREFIX is the path declared by the cookie, which always begins with /, and FULL_PATH is the URL path coming from the URL parsing code, which doesn't begin with a /. To match them, one must indeed strip the leading / off PREFIX. But paths without a slash still caused subtle problems. For example, cookies without a path attribute still had to be stored with the correct cookie-path (with a leading slash). To account for this, the invocation of cookie_handle_set_cookie was modified to prepend the / before the path. This lead to path_match unexpectedly receiving two /-prefixed paths and being unable to match them. The attached patch fixes the problem by: * Making sure that path consistently gets prepended in all entry points to cookie code; * Removing the special logic from path_match. With that change your test case seems to work, and so do all the other tests I could think of. Please let me know if it works for you, and thanks for the detailed bug report. 2005-06-24 Hrvoje Niksic <[EMAIL PROTECTED]> * http.c (gethttp): Don't prepend / here. * cookies.c (cookie_handle_set_cookie): Prepend / to PATH. (cookie_header): Ditto. Index: src/http.c === --- src/http.c (revision 1794) +++ src/http.c (working copy) @@ -1706,7 +1706,6 @@ /* Handle (possibly multiple instances of) the Set-Cookie header. */ if (opt.cookies) { - char *pth = NULL; int scpos; const char *scbeg, *scend; /* The jar should have been created by now. */ @@ -1717,15 +1716,8 @@ ++scpos) { char *set_cookie; BOUNDED_TO_ALLOCA (scbeg, scend, set_cookie); - if (pth == NULL) - { - /* u->path doesn't begin with /, which cookies.c expects. */ - pth = (char *) alloca (1 + strlen (u->path) + 1); - pth[0] = '/'; - strcpy (pth + 1, u->path); - } - cookie_handle_set_cookie (wget_cookie_jar, u->host, u->port, pth, - set_cookie); + cookie_handle_set_cookie (wget_cookie_jar, u->host, u->port, + u->path, set_cookie); } } Index: src/cookies.c === --- src/cookies.c (revision 1794) +++ src/cookies.c (working copy) @@ -822,6 +822,17 @@ { return path_matches (path, cookie_path) != 0; } + +/* Prepend '/' to string S. S is copied to fresh stack-allocated + space and its value is modified to point to the new location. */ + +#define PREPEND_SLASH(s) do { \ + char *PS_newstr = (char *) alloca (1 + strlen (s) + 1); \ + *PS_newstr = '/';\ + strcpy (PS_newstr + 1, s); \ + s = PS_newstr; \ +} while (0) + /* Process the HTTP `Set-Cookie' header. This results in storing the cookie or discarding a matching one, or ignoring it completely, all @@ -835,6 +846,11 @@ struct cookie *cookie; cookies_now = time (NULL); + /* Wget's paths don't begin with '/' (blame rfc1808), but cookie + usage assumes /-prefixed paths. Until the rest of Wget is fixed, + simply prepend slash to PATH. */ + PREPEND_SLASH (path); + cookie = parse_set_cookies (set_cookie, update_cookie_field, false); if (!cookie) goto out; @@ -977,17 +993,8 @@ static int path_matches (const char *full_path, const char *prefix) { - int len; + int len = strlen (prefix); - if (*prefix != '/') -/* Wget's HTTP paths do not begin with '/' (the URL code treats it - as a mere separator, inspired by rfc1808), but the '/' is - assumed when matching against the cookie stuff. */ -return 0; - - ++prefix; - len = strlen (prefix); - if (0 != strncmp (full_path, prefix, len)) /* FULL_PATH doesn't begin with PREFIX. */ return 0; @@ -1149,6 +1156,7 @@ int count, i, ocnt; char *result; int result_size, pos; + PREPEND_SLASH (path);/* see cookie_handle_set_cookie */ /* First, find the cooki
Bug handling session cookies
Hello folks, I'm running wget v1.10 compiled from source (tested on HP-UX and Linux). I am having problems handling session cookies. The idea is to request a web page which returns an ID number in a session cookie. All subsequent requests from the site must contain this session cookie. I'm using a command line as follows: wget --no-proxy --save-cookies cookies.txt --keep-session-cookies http://ttms:9900/testdb-bin/login -O - The headers returned from the webserver are as follows: ---request begin--- GET /testdb-bin/login HTTP/1.0 User-Agent: Wget/1.10 Accept: */* Host: ttms:9900 Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Date: Fri, 24 Jun 2005 09:22:38 GMT Server: Apache/2.0.51 (Unix) PHP/4.3.3 Set-Cookie: SessionID=1119604958; path=/testdb-bin Connection: close Content-Type: text/html; charset=ISO-8859-1 ---response end--- However, the cookie.txt file is empty... $ cat cookie.txt # HTTP cookie file. # Generated by Wget on 2005-06-24 10:22:38. # Edit at your own risk. $ I've looked at the source code, in cookie.c I've added debug to print out the contents of full_path and prefix in the path_matches() function. The output is as follows: path_matches() full_path: /testdb-bin/login, prefix: /testdb-bin [ on function entry, i.e. before ++prefix statement ] path_matches() calling strncmp("/testdb-bin/login", "testdb-bin", 10) = -69 Note that the forward slash is stripped from "prefix", hence never matches "full_path". I'm not sure why this is done in the code. Is there a problem here? Or am I doing something wrong? The path returned in the cookie from the webserver seems valid. It's generated by the Perl CGI module cookie method and seems consistent with the CGI man page. For now, I've hacked the path_matches() function to ensure that the slash prefixes are always consistent... /* MNS hack for fixing cookie leading slashes */ if (*prefix == '/' && *full_path != '/') prefix++; if (*prefix != '/' && *full_path == '/') full_path++; /* MNS end of hack */ // ++prefix; MNS was original code If I try the same test with something like www.google.com, the cookie file gets created sucessfully - although this isn't a session cookie, of course. Cheers, Mark.