Re: Strange character in file length
On Tuesday, March 7, 2006 at 8:56:15 +0100, Hrvoje Niksic wrote: Alain Bench [EMAIL PROTECTED] writes: unusable nonsense totally weird. So far Google didn't help me much. Or rather discouraged me with Borland supports only C locale-like statements. But I didn't found any official doc. Anybody has infos on Borland's own libc? Either there is a magic option I missed, or I'd recommend to treat Borland as C locale (forcing coma separator and grouping by 3). I suggest we do the latter The alternative seemingly would be: Change 3;0 to \003\000, and transcode separator from GetACP() to GetConsoleOutputCP(). Would it work in all cases? Which transcoding function? And this in a Borland specific code... against 1.10.x? It's always a better idea to patch thed trunk code Sure: I installed Subversion, and began learning it, just to put my hands on the said trunk (found no tar.gz snapshots?). But then the conflicting types for `uintptr_t' stopped me. Bye!Alain. -- When you post a new message, beginning a new topic, use the mail or post or new message functions. When you reply or followup, use the reply or followup functions. Do not do the one for the other, this breaks or hijacks threads.
Re: Strange character in file length
On Thursday, March 2, 2006 at 22:28:28 +0100, Hrvoje Niksic wrote: you can get a free compiler from here: http://www.borland.com/downloads/download_cbuilder.html Nice tip, thank you! Bad news: Borland 5.5.1 seems to do locales in its own way. Not at all as I explained here, about msvcrt.dll. Seems that: · setlocale() returns always a composite string with newlines (as if categories were not identical). Example setlocale(LC_ALL, .852) returns: | LC_MONETARY=French_France.852 | LC_TIME=French_France.852 | LC_NUMERIC=French_France.852 | LC_COLLATE=French_France.852 | LC_CTYPE=French_France.852 · setlocale() does French_France.850, not following chcp. · It doesn't know about .OCP nor .ACP, but still accepts C, .1252 and such. · Whatever setlocale charset, localeconv() only outputs CP-1252. · localeconv() grouping gives an Ascii string 3;0 (or 3;2;0 for Indian). Wget of course groups by... 51 digits (Ascii code of 3). Seems a little bit like unusable nonsense to me. Either there is a magic option I missed, or I'd recommend to treat Borland as C locale (forcing coma separator and grouping by 3). The test code works well on MinGW. Wget itself doesn't like the unixish ./configure and make procedure under Msys 1.0.10, but I found the configure.bat --mingw way to use directly MinGW 3.1.0. Wget 1.10.2 so compiles, and seemingly grouping, charset, and decimal point work well with the attached patch. BTW I had a problem compiling straight subversion trunk: | F:\wget-r2129\srcmingw32-make.exe | gcc -DWINDOWS -DHAVE_CONFIG_H -O3 -Wall -I. -c -o cmpt.o cmpt.c | In file included from wget.h:89, | from cmpt.c:43: | sysdep.h:199: warning: redefinition of `uint32_t' | c:/MinGW/include/stdint.h:32: warning: `uint32_t' previously declared here | sysdep.h:215: conflicting types for `uintptr_t' | c:/MinGW/include/stdint.h:61: previous declaration of `uintptr_t' | mingw32-make.exe: *** [cmpt.o] Error 1 Bye!Alain. -- How To Ask Questions The Smart Way URL:http://www.catb.org/~esr/faqs/smart-questions.html wget-1.10.2.win32-setlocale.1.patch.gz Description: application/gunzip
Re: Strange character in file length
On Thursday, March 2, 2006 at 7:51:43 +0100, Hrvoje Niksic wrote: Then the code could look like this: Seems good to me. I can help testing, if someone compiles. Bye!Alain. -- Give your computer's unused idle processor cycles to a scientific goal: The [EMAIL PROTECTED] project at URL:http://folding.stanford.edu/.
Re: Strange character in file length
On Saturday, February 25, 2006 at 21:06:19 +0100, Hrvoje Niksic wrote: Is the current charset of the console ever really different than the default OEM charset? They are identical by default. But the first can be changed in each console window, while the later is fixed on a given Windows install. How does one change the console charset, anyway? Thru chcp command in a cmd.exe session, or thru a call to SetConsoleCP() or SetConsoleOutputCP() in an app. the setlocale invocation should look like this: Hum... You dropped the fallback to ANSI when GetConsoleOutputCP() returns 0. That's fine, if it's considered useless. But it could lead to a setlocale(LC_ALL, .0), with unknown behaviour. Hopefully it then fails, returning NULL, as it does in my setup. But I'm not sure it does that in all setups. Wget is calling setlocale(LC_ALL, ) only if HAVE_NLS is defined, which is typically not the case on Windows, as HAVE_NLS implies existence of gettext, textdomain, and bindtextdomain. Also utils.c:get_grouping_data() does setlocale(LC_NUMERIC, ) temporarily. For some platforms including Windows, doing LC_NUMERIC alone is not guaranteed to have the desired effect. Example on Windows the call with .850 succeeds, but selects the default ANSI locale (or whatever was set by LC_ALL). In fact it seems to me that all manipulations of a category alone, outside of LC_ALL, are calling for problems on this or that platform. Especially when the charsets are incompatible between categories (up to segfaults, when cooperating with some buggy Glibc versions). Now, if the main setlocale(LC_ALL) is always called, I believe that get_grouping_data() can be greatly simplified, dropping the #ifdef LC_NUMERIC setlocale(LC_NUMERIC). And just calling localeconv() if it exists. Testers able to compile Wget and reproduce this problem would be much appreciated. Beware: The default console font Terminal having only an OEM/DOS script is unable to correctly follow chcp commands. A smarter font like Lucida Console is better suited for testing. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so with Lotus Notes 5. This lacks necessary references and breaks threads.
Re: Strange character in file length
On Wednesday, March 1, 2006 at 16:13:17 +0100, Hrvoje Niksic wrote: Alain Bench [EMAIL PROTECTED] writes: fallback to ANSI when GetConsoleOutputCP() returns 0. I didn't know it could return 0. I don't know exactly how, but it can. Apparently a graphic frontend starting a text mode command without a console can arrange GetConsoleOutputCP() to return 0 to the text command. The command should then output ANSI text, probably not for direct display, but for processing by the graphic app. The only example I heard about is The Bat!™ mailer calling GnuPG as crypto tool. Would be appropriate in that case? Yes: setlocale(LC_ALL, ) should always select the ANSI charset, suitable for graphic mode apps. Finally GetACP() is not needed, as does implicitly the same. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so with Hushmail. This lacks necessary references and breaks threads.
Re: Strange character in file length
Hi Hrvoje, On Tuesday, February 21, 2006 at 21:35:24 +0100, Hrvoje Niksic wrote: Valery Kondakoff [EMAIL PROTECTED] writes: wrong ANSI/OEM character encoding What are the steps a Windows console program needs to do to perform this conversion correctly? Call setlocale(LC_ALL, .OCP) which will select the default OEM charset of the current Windows language. OCP means OEM Code Page, and console apps by default need to use this OEM charset: Probably CP-852 for you, CP-850 for me, and so on. Here this setlocale .OCP returns French_France.850. Another possibly better way, able to follow the current charset of the console (not only the default): Call GetConsoleOutputCP(), get example 850, build a string .850 with the dot, and call setlocale(LC_ALL, .850). Problem: Not every combination of language, country, and charset is possible. So deal with errors (setlocale returns NULL), and fallback to .OCP. Finally if GetConsoleOutputCP() fails returning 0, call GetACP() instead, as a fallback. This might eventually suit graphic frontends, which would need an ANSI codepage output. I don't have what's needed to compile wget on Windows, otherwise I would have done a patch. MinGW32 and MSYS can't build wget, right? Anyway I attach a demo program: | C:\home\abchcp | Page de codes active : 850# French console default | | C:\home\abwin32-console-locale.exe | locale=French_France.850 | codepage=850 | thousands_sep= (code FF) # no-break space in CP-850 | | C:\home\abchcp 28591 # that's Latin-1 code page | Page de codes activeá: 28591 | | C:\home\abwin32-console-locale.exe | locale=French_France.28591 | codepage=28591 | thousands_sep=á (code A0) # no-break space in Latin-1 Bye!Alain. -- When you post a new message, beginning a new topic, use the mail or post or new message functions. When you reply or followup, use the reply or followup functions. Do not do the one for the other, this breaks or hijacks threads. #include stdio.h #include locale.h #include windows.h Set_the_locale_for_the_fine_win32_console () { char *locale; int codepage; char param[42]; codepage=GetConsoleOutputCP(); if (codepage) { sprintf(param, .%d, codepage); locale=setlocale(LC_ALL, param);/* use current console OEM charset */ if (locale == NULL) { locale=setlocale(LC_ALL, .OCP); /* use system default OEM charset */ } } else { locale=setlocale(LC_ALL, ); /* use ANSI charset (for graphic apps) */ } printf(locale=%s\ncodepage=%d\n, locale, codepage ? codepage : GetACP()); } main () { struct lconv *lconv; Set_the_locale_for_the_fine_win32_console(); lconv=localeconv(); printf(thousands_sep=\%s\ (code %02X)\n, lconv-thousands_sep, (unsigned char)lconv-thousands_sep[0]); }
Re: Removing thousand separators from file size output
On Saturday, July 2, 2005 at 12:38:24 PM +0200, Hrvoje Niksic wrote: print numbers according to the locale. Much thanks, Hrvoje! [full size] doesn't use the separators Copy/pastability won over readability: Fine. You exposed the problem, heard other's arguments, and took a decision. Is it permited to say that, even if I lost this battle, I very much like the way you deal with wget development? :-) Bye!Alain. -- When you post a new message, beginning a new topic, use the mail or post or new message functions. When you reply or followup, use the reply or followup functions. Do not do the one for the other, this breaks or hijacks threads.
Re: Removing thousand separators from file size output
Hello Tony, On Friday, June 24, 2005 at 11:57:22 AM -0700, Tony Lewis wrote: Hrvoje Niksic wrote: application that accepts numbers as Wget prints them. Microsoft Calculator does. Not here. This seems to be locale dependant, requiring exact localized input. Here MS Calculator accepts pasted 123 456 789,01 as correct 123456789.01, but when pasted wget's English 123,456,789.01 it fails, interpreting this as 123.456789 and beeping. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so with Lotus Notes 5. This lacks necessary references and breaks threads.
Re: Removing thousand separators from file size output
On Friday, June 24, 2005 at 6:45:44 PM +0200, Hrvoje Niksic wrote: input for other applications, which is very hard with the thousand separators. Pasting is very hard, parsing is not. An app running wget can easely parse it's output, whatever it is. If not directly then thru a wrapper. The problem is only with side-apps where user must copy/paste. How frequently is that used? Removing separators will break existing apps parsing wget's output. Such apps exist? Alain Bench [EMAIL PROTECTED] writes: Humans can have habit to look at exact unit size, or rounded kilo/mega/tera size, or both. omitting the thousand separators merely removes redundancy, not useful information. That's true only if you assume the user analyses the /unit-size/ and /kmt-size/ as a whole, as a unique info. But that's not always the case. One may well look only at /unit-size/. Without seps, this user is forced to count digits, or to look additionally to /kmt-size/, and do some brainwork to find corresponding order of magnitude. For this user, sep removal removes readability. If the users were so used to separators, they would surely request them in other programs, such as `ls', `du', or `df'? Those 3 commands print numbers in right-aligned columns: The ergonomic need for seps is a little lower. And the ls -l filename truncation on 80 wide terms might be seen as a bigger annoyance: 3 seps added in size would mean 3 chars less in filename. And legacy behaviour *MUST* absolutly be retained for such old, widely used, and frequently machine-parsed commands. But anyway I would personally love to see separators here too. [localization] You can make a case that the correct character and layout should be used for digit grouping when it is deployed, but I don't see how you can argue that grouping *must* be used in all applications! I agree. There are cases where localized grouping and even grouping alone are useless or harmfull: Each time the only or primary destination of a number is another app. But when the intendend reader is human, localized grouping *should* be used. Unless a bigger unavoidable danger interferes. That's my humble opinion, but I believe it's also some more general ergonomic principle. I am able to buy the small advantage over code complexity ratio argument you once explained. But I somewhat regret having to buy it. BTW my locale thousands_sep gives a non-breaking space, and locale decimal_point gives a , comma. As for localization, I'm not against it. The argument was that, where possible, I prefer the output of applications to remain parsable. So we disagree only on the balance. I'd say output to humans should be localized as much as possible, unless this creates a really serious problem for the machine parsing secondary usage. Where incompatible, human and machine output may be separated. Say on option, or like GnuPG --status-fd simultaneously: Human reads stdout/err, while machine parses another fd. That's material for present debate, not my wish for wget. I consider the ISO 8601 date format a clear advantage over the asctime() format. ;-) Good example: I *hate* having to read 8601 dates. Nearly as much as having to read those other dates, localized or not, with month/day ambiguity. MHO only, here: I know some people love 8601. Bye!Alain. -- « if you believe subversive history books, I've got a bridge to sell you. »
Re: ChangeLog-branches
Hello Hrvoje, On Thursday, June 23, 2005 at 9:00:44 PM +0200, Hrvoje Niksic wrote: the ChangeLog-branches directories distributed with Wget are desirable or necessary? MHO: They are ununderstandable, unusable, unclean, and big. They may give a false bad impression of source/project misorganization. We want to drop them, wipe any proof of their existence from any archives and mirrors, then honestly deny they ever existed. No need to kill witnesses though: Who would believe them? Bye!Alain. -- Microsoft Outlook Express users concerned about readability: For much better viewing quotes in your messages, check the little freeware program OE-QuoteFix by Dominik Jain on URL:http://flash.to/oblivion/. It'll change your life. :-) Now exists also for Outlook.
Re: Removing thousand separators from file size output
On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote: Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am thinking of removing the thousand separators from size display. IMHO thousand (or myriad) separators are necessary. This size display is primarily intended for humans, not for other apps. If separators constitute a difficulty for other apps, then it's these other apps problem. Or sed's task (s/,//g). Humans can have habit to look at exact unit size, or rounded kilo/mega/tera size, or both. It would be a regression to reduce readability of legacy exact bytes count, just because we have a new added more human-readable but rounded count. The separators are interpunction which introduces clutter, especially with complex size output also containing the remaining size next to the whole size. True: The more info, the more confusion. But that's the contrary of a valid reason to reduce readability of those infos. And IMHO removing thousand separators reduces readability. replace the , character with the character mandated by the locale This seems naturally desirable. I don't really understand nor follow your reasons against localization. User's cultural preferences should be respected. OTOS this is not so important nor urgent, compared to thousand serparators removal cons. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: Character encoding
Hello Georg, On Friday, April 1, 2005 at 12:01:15 PM +0200, Georg Bauhaus wrote: The apostrophy might have been typed as an accent (acute) really Most probably the RIGHT SINGLE QUOTATION MARK U+2019, , encoded in UTF-8, then wrongly seen as being CP-1252. It would look like (a circumflex, euro symbol, trademark sign), and once transliterated to Latin-1 like EUR(tm). Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: Gmane
Hello Hrvoje, wishing you all well! On Saturday, February 19, 2005 at 6:20:52 PM +0100, Hrvoje Niksic wrote: I propose to make this list available via gmane, www.gmane.com. It buys us good archiving, as well as NNTP access. Is there anyone who would object to that? There are pros and cons. Wider audience and potential contributors. But greater exposition to spam, both of the list and it's members. And that infamous gmane message-id overwriting, that breaks our threads. BTW Hrvoje, do you want a gzipped mbox of missed wget-patches posts? Please give me date boundaries, I'll be happy to help. Bye!Alain. -- Microsoft Outlook Express users concerned about readability: For much better viewing quotes in your messages, check the little freeware program OE-QuoteFix by Dominik Jain on URL:http://flash.to/oblivion/. It'll change your life. :-) Now exists also for Outlook.
Re: utf-8 encoded html documents
Hello George, On Tuesday, February 1, 2005 at 7:49:55 AM -0800, George Prekas wrote: I am using wget 1.9.1 under Windows XP and I have noticed that it is completely incapable of handling utf-8 encoded html documents. I am not aware of any problem with UTF-8 pages: Just work fine. What error do you get, and what correct result would you want? What sort of handling are you talking about? If it's transcoding from UTF-8 to whatever charset you use, that's probably not Wget's job. Good browsers should be able to read UTF-8 file (unless malformed header). Otherwise there are recoding tools. check it out for your self here: http://www.tsiamoulisschools.gr I only get: does not exist (Authoritative answer). Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: Suggestion, --range
Hello Robert, On Thursday, September 30, 2004 at 6:36:43 PM +0200, Robert Thomson wrote: It would be really advantageous if wget had a --range command line argument, that would download a range of bytes of a file, if the server supports it. You could try the feature patch posted by Rodrigo S. Wanderley last year on the wget mailing list. The guy made the work, and nobody gave feedback :-\. See [EMAIL PROTECTED]. Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: timeout on closing connection
On Saturday, November 29, 2003 at 4:15:19 PM +0100, Hrvoje Niksic wrote: Alain Bench [EMAIL PROTECTED] writes: I sometimes seem to be stuck in an overly long (like more than 1 hour) timeout on closing connection during the kernel close() call? Did you confirm that with trace? No, but I'll try strace next time it happens. I don't really understand what's going on. It's only I saw Wget still running long after hangup, and netstat showed odd things as a connection still in closing state (FIN wait 1 or such). Killing Wget by ^C cleaned the netstat. Nothing more precise, and of course now I'd want to analyse things, it doesn't happen anymore... ;-) Bye!Alain.
timeout on closing connection
Hello, Wget 1.9.1: I sometimes seem to be stuck in an overly long (like more than 1 hour) timeout on closing connection, when server went down or modem hangup during a read or just before close. I use Wget's default timeout (0, 0, 900), or sometimes --timeout=30 (30, 30, 30), and understand it's for name resolution, initial connect, and read. But what about close? Bye!Alain. -- Give your computer's unused idle processor cycles to a scientific goal: The [EMAIL PROTECTED] project at URL:http://genomeathome.stanford.edu/.
Re: keep alive connections
On Tuesday, November 11, 2003 at 2:41:31 PM +0100, Hrvoje Niksic wrote: Alain Bench [EMAIL PROTECTED] writes: with --timestamping: Each HEAD and each possible GET uses a new connection. I think the difference is that Wget closes the connection when it decides not to read the request body. OK, wasn't aware of the spurious HEAD bodies problem. But Wget also closes the connection between a GET (with body) and the HEAD for the next file. But maybe it would actually be a better idea to read (and discard) the body than to close the connection and reopen it. Hum... Would it be possible to close/reopen only if, and as soon as, first byte of spurious body comes? I guess it could be difficult to deal cleanly with next file in limit cases... | Keep-Alive: timeout=15, max=5 Without --timestamping Wget keeps Reusing fd 3. and closing it only once every 6 files (first + 5 more). This might be due to redirections. No redirections involved: That closure is normal, due to the max=5 the server responds to the first request. At second GET it's max=4 and gets decremented each time. Finally at the 6th request there is no more Connection: nor Keep-Alive: fields. The /etc/apache/httpd.conf says: | # KeepAlive: The number of Keep-Alive persistent requests to accept | # per connection. Set to 0 to deactivate Keep-Alive support | KeepAlive 5 | | # KeepAliveTimeout: Number of seconds to wait for the next request | KeepAliveTimeout 15 Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: keep alive connections
Hello Hrvoje, On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote: Wget uses the `Keep-Alive' request header to request persistent connections, and understands both the HTTP/1.0 `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header. This doesn't seem to work together with --timestamping: Each HEAD and each possible GET uses a new connection. The server keeps responding: | HTTP/1.0 200 OK | [...] | Connection: Keep-Alive | Keep-Alive: timeout=15, max=5 But Wget 1.9 does each time: | Created socket 3. | [snip request/response] | Registered fd 3 for persistent reuse. | Closing fd 3 | Invalidating fd 3 from further reuse. | Remote file is newer, retrieving. | Created socket 3. | [and so on] Tcpdump confirms the TCP session is FIN closed by Wget. Without --timestamping Wget keeps Reusing fd 3. and closing it only once every 6 files (first + 5 more). At this moment the FIN would in any case be initiated by the server if not by Wget. Test made on an old Apache 1.1.3, but it seems the same with other servers. BTW, it's nice to see you back and active, Hrvoje! :-) Bye!Alain. -- Mutt 1.5.5.1 is released.
Re: Another space problem
Hello Matt, On Sunday, July 14, 2002 at 1:51:28 PM +1200, Matt wrote: The actual command in the script is: wget [...] $1 However, sometimes the directories have spaces in them. That's not a wget issue, just a basic script programming one: You must quote the parameter also inside the script, as $1. From: Matt matt[EMAIL PROTECTED] Reply-To: mattsarah@n!o!s!p!a!m!email.message.co_nz You may need to remove n!o!s!p!a!m! to reply Interesting: made this way, it's _only_ an annoyance for repliers, not at all for spammers. Is it really what you intended to do? HTH, and bye! Alain.