Re: Strange character in file length

2006-03-09 Thread Alain Bench
 On Tuesday, March 7, 2006 at 8:56:15 +0100, Hrvoje Niksic wrote:

 Alain Bench [EMAIL PROTECTED] writes:
 unusable nonsense
 totally weird.

So far Google didn't help me much. Or rather discouraged me with
Borland supports only C locale-like statements. But I didn't found any
official doc. Anybody has infos on Borland's own libc?


 Either there is a magic option I missed, or I'd recommend to treat
 Borland as C locale (forcing coma separator and grouping by 3).
 I suggest we do the latter

The alternative seemingly would be: Change 3;0 to \003\000, and
transcode separator from GetACP() to GetConsoleOutputCP(). Would it work
in all cases? Which transcoding function? And this in a Borland specific
code...


 against 1.10.x? It's always a better idea to patch thed trunk code

Sure: I installed Subversion, and began learning it, just to put my
hands on the said trunk (found no tar.gz snapshots?). But then the
conflicting types for `uintptr_t' stopped me.


Bye!Alain.
-- 
When you post a new message, beginning a new topic, use the mail or
post or new message functions.
When you reply or followup, use the reply or followup functions.
Do not do the one for the other, this breaks or hijacks threads.


Re: Strange character in file length

2006-03-06 Thread Alain Bench
 On Thursday, March 2, 2006 at 22:28:28 +0100, Hrvoje Niksic wrote:

 you can get a free compiler from here:
 http://www.borland.com/downloads/download_cbuilder.html

Nice tip, thank you!


Bad news: Borland 5.5.1 seems to do locales in its own way. Not at
all as I explained here, about msvcrt.dll. Seems that:

 · setlocale() returns always a composite string with newlines (as if
categories were not identical). Example setlocale(LC_ALL, .852)
returns:

| LC_MONETARY=French_France.852
| LC_TIME=French_France.852
| LC_NUMERIC=French_France.852
| LC_COLLATE=French_France.852
| LC_CTYPE=French_France.852

 · setlocale() does French_France.850, not following chcp.

 · It doesn't know about .OCP nor .ACP, but still accepts C,
.1252 and such.

 · Whatever setlocale charset, localeconv() only outputs CP-1252.

 · localeconv() grouping gives an Ascii string 3;0 (or 3;2;0 for
Indian). Wget of course groups by... 51 digits (Ascii code of 3).

Seems a little bit like unusable nonsense to me. Either there is a
magic option I missed, or I'd recommend to treat Borland as C locale
(forcing coma separator and grouping by 3).


The test code works well on MinGW. Wget itself doesn't like the
unixish ./configure and make procedure under Msys 1.0.10, but I found
the configure.bat --mingw way to use directly MinGW 3.1.0. Wget 1.10.2
so compiles, and seemingly grouping, charset, and decimal point work
well with the attached patch.

BTW I had a problem compiling straight subversion trunk:

| F:\wget-r2129\srcmingw32-make.exe
| gcc -DWINDOWS -DHAVE_CONFIG_H -O3 -Wall -I.   -c -o cmpt.o cmpt.c
| In file included from wget.h:89,
|  from cmpt.c:43:
| sysdep.h:199: warning: redefinition of `uint32_t'
| c:/MinGW/include/stdint.h:32: warning: `uint32_t' previously declared here
| sysdep.h:215: conflicting types for `uintptr_t'
| c:/MinGW/include/stdint.h:61: previous declaration of `uintptr_t'
| mingw32-make.exe: *** [cmpt.o] Error 1


Bye!Alain.
-- 
How To Ask Questions The Smart Way
URL:http://www.catb.org/~esr/faqs/smart-questions.html


wget-1.10.2.win32-setlocale.1.patch.gz
Description: application/gunzip


Re: Strange character in file length

2006-03-02 Thread Alain Bench
 On Thursday, March 2, 2006 at 7:51:43 +0100, Hrvoje Niksic wrote:

 Then the code could look like this:

Seems good to me. I can help testing, if someone compiles.


Bye!Alain.
-- 
Give your computer's unused idle processor cycles to a scientific goal:
The [EMAIL PROTECTED] project at URL:http://folding.stanford.edu/.


Re: Strange character in file length

2006-03-01 Thread Alain Bench
 On Saturday, February 25, 2006 at 21:06:19 +0100, Hrvoje Niksic wrote:

 Is the current charset of the console ever really different than the
 default OEM charset?

They are identical by default. But the first can be changed in each
console window, while the later is fixed on a given Windows install.


 How does one change the console charset, anyway?

Thru chcp command in a cmd.exe session, or thru a call to
SetConsoleCP() or SetConsoleOutputCP() in an app.


 the setlocale invocation should look like this:

Hum... You dropped the fallback to ANSI when GetConsoleOutputCP()
returns 0. That's fine, if it's considered useless. But it could lead to
a setlocale(LC_ALL, .0), with unknown behaviour. Hopefully it then
fails, returning NULL, as it does in my setup. But I'm not sure it does
that in all setups.


 Wget is calling setlocale(LC_ALL, ) only if HAVE_NLS is defined,
 which is typically not the case on Windows, as HAVE_NLS implies
 existence of gettext, textdomain, and bindtextdomain.

Also utils.c:get_grouping_data() does setlocale(LC_NUMERIC, )
temporarily. For some platforms including Windows, doing LC_NUMERIC
alone is not guaranteed to have the desired effect. Example on Windows
the call with .850 succeeds, but selects the default ANSI locale (or
whatever was set by LC_ALL).

In fact it seems to me that all manipulations of a category alone,
outside of LC_ALL, are calling for problems on this or that platform.
Especially when the charsets are incompatible between categories (up to
segfaults, when cooperating with some buggy Glibc versions).

Now, if the main setlocale(LC_ALL) is always called, I believe that
get_grouping_data() can be greatly simplified, dropping the
#ifdef LC_NUMERIC setlocale(LC_NUMERIC). And just calling localeconv()
if it exists.


 Testers able to compile Wget and reproduce this problem would be much
 appreciated.

Beware: The default console font Terminal having only an OEM/DOS
script is unable to correctly follow chcp commands. A smarter font
like Lucida Console is better suited for testing.


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so with
Lotus Notes 5. This lacks necessary references and breaks threads.


Re: Strange character in file length

2006-03-01 Thread Alain Bench
 On Wednesday, March 1, 2006 at 16:13:17 +0100, Hrvoje Niksic wrote:

 Alain Bench [EMAIL PROTECTED] writes:
 fallback to ANSI when GetConsoleOutputCP() returns 0.
 I didn't know it could return 0.

I don't know exactly how, but it can. Apparently a graphic frontend
starting a text mode command without a console can arrange
GetConsoleOutputCP() to return 0 to the text command. The command should
then output ANSI text, probably not for direct display, but for
processing by the graphic app. The only example I heard about is
The Bat!™ mailer calling GnuPG as crypto tool.


 Would  be appropriate in that case?

Yes: setlocale(LC_ALL, ) should always select the ANSI charset,
suitable for graphic mode apps. Finally GetACP() is not needed, as 
does implicitly the same.


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so with
Hushmail. This lacks necessary references and breaks threads.


Re: Strange character in file length

2006-02-25 Thread Alain Bench
Hi Hrvoje,

 On Tuesday, February 21, 2006 at 21:35:24 +0100, Hrvoje Niksic wrote:

 Valery Kondakoff [EMAIL PROTECTED] writes:
 wrong ANSI/OEM character encoding
 What are the steps a Windows console program needs to do to perform
 this conversion correctly?

Call setlocale(LC_ALL, .OCP) which will select the default OEM
charset of the current Windows language. OCP means OEM Code Page, and
console apps by default need to use this OEM charset: Probably CP-852
for you, CP-850 for me, and so on. Here this setlocale .OCP returns
French_France.850.

Another possibly better way, able to follow the current charset of
the console (not only the default): Call GetConsoleOutputCP(), get
example 850, build a string .850 with the dot, and call
setlocale(LC_ALL, .850). Problem: Not every combination of language,
country, and charset is possible. So deal with errors (setlocale returns
NULL), and fallback to .OCP.

Finally if GetConsoleOutputCP() fails returning 0, call GetACP()
instead, as a fallback. This might eventually suit graphic frontends,
which would need an ANSI codepage output.


I don't have what's needed to compile wget on Windows, otherwise I
would have done a patch. MinGW32 and MSYS can't build wget, right?
Anyway I attach a demo program:

| C:\home\abchcp
| Page de codes active : 850# French console default
|
| C:\home\abwin32-console-locale.exe
| locale=French_France.850
| codepage=850
| thousands_sep=  (code FF)   # no-break space in CP-850
|
| C:\home\abchcp 28591 # that's Latin-1 code page
| Page de codes activeá: 28591
|
| C:\home\abwin32-console-locale.exe
| locale=French_France.28591
| codepage=28591
| thousands_sep=á (code A0)   # no-break space in Latin-1


Bye!Alain.
-- 
When you post a new message, beginning a new topic, use the mail or
post or new message functions.
When you reply or followup, use the reply or followup functions.
Do not do the one for the other, this breaks or hijacks threads.
#include stdio.h
#include locale.h
#include windows.h

Set_the_locale_for_the_fine_win32_console () {
  char *locale;
  int codepage;
  char param[42];

  codepage=GetConsoleOutputCP();
  if (codepage) {
sprintf(param, .%d, codepage);
locale=setlocale(LC_ALL, param);/* use current console OEM 
charset */
if (locale == NULL) {
  locale=setlocale(LC_ALL, .OCP); /* use system default OEM 
charset */
}
  }
  else {
locale=setlocale(LC_ALL, );   /* use ANSI charset (for 
graphic apps) */
  }

  printf(locale=%s\ncodepage=%d\n, locale, codepage ? codepage : GetACP());
}

main () {
  struct lconv *lconv;

  Set_the_locale_for_the_fine_win32_console();

  lconv=localeconv();
  printf(thousands_sep=\%s\ (code %02X)\n,
lconv-thousands_sep,
(unsigned char)lconv-thousands_sep[0]);
}


Re: Removing thousand separators from file size output

2005-07-03 Thread Alain Bench
 On Saturday, July 2, 2005 at 12:38:24 PM +0200, Hrvoje Niksic wrote:

 print numbers according to the locale.

Much thanks, Hrvoje!


 [full size] doesn't use the separators

Copy/pastability won over readability: Fine. You exposed the
problem, heard other's arguments, and took a decision. Is it permited to
say that, even if I lost this battle, I very much like the way you deal
with wget development? :-)


Bye!Alain.
-- 
When you post a new message, beginning a new topic, use the mail or
post or new message functions.
When you reply or followup, use the reply or followup functions.
Do not do the one for the other, this breaks or hijacks threads.


Re: Removing thousand separators from file size output

2005-07-02 Thread Alain Bench
Hello Tony,

 On Friday, June 24, 2005 at 11:57:22 AM -0700, Tony Lewis wrote:

 Hrvoje Niksic wrote:
 application that accepts numbers as Wget prints them.
 Microsoft Calculator does.

Not here. This seems to be locale dependant, requiring exact
localized input. Here MS Calculator accepts pasted 123 456 789,01 as
correct 123456789.01, but when pasted wget's English 123,456,789.01 it
fails, interpreting this as 123.456789 and beeping.


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so with
Lotus Notes 5. This lacks necessary references and breaks threads.


Re: Removing thousand separators from file size output

2005-06-25 Thread Alain Bench
 On Friday, June 24, 2005 at 6:45:44 PM +0200, Hrvoje Niksic wrote:

 input for other applications, which is very hard with the thousand
 separators.

Pasting is very hard, parsing is not. An app running wget can easely
parse it's output, whatever it is. If not directly then thru a wrapper.
The problem is only with side-apps where user must copy/paste. How
frequently is that used?

Removing separators will break existing apps parsing wget's output.
Such apps exist?


 Alain Bench [EMAIL PROTECTED] writes:
 Humans can have habit to look at exact unit size, or rounded
 kilo/mega/tera size, or both.
 omitting the thousand separators merely removes redundancy, not useful
 information.

That's true only if you assume the user analyses the /unit-size/ and
/kmt-size/ as a whole, as a unique info. But that's not always the case.
One may well look only at /unit-size/. Without seps, this user is forced
to count digits, or to look additionally to /kmt-size/, and do some
brainwork to find corresponding order of magnitude. For this user, sep
removal removes readability.


 If the users were so used to separators, they would surely request
 them in other programs, such as `ls', `du', or `df'?

Those 3 commands print numbers in right-aligned columns: The
ergonomic need for seps is a little lower. And the ls -l filename
truncation on 80 wide terms might be seen as a bigger annoyance: 3 seps
added in size would mean 3 chars less in filename. And legacy behaviour
*MUST* absolutly be retained for such old, widely used, and frequently
machine-parsed commands.

But anyway I would personally love to see separators here too.


[localization]
 You can make a case that the correct character and layout should be
 used for digit grouping when it is deployed, but I don't see how you
 can argue that grouping *must* be used in all applications!

I agree. There are cases where localized grouping and even grouping
alone are useless or harmfull: Each time the only or primary destination
of a number is another app.

But when the intendend reader is human, localized grouping *should*
be used. Unless a bigger unavoidable danger interferes. That's my humble
opinion, but I believe it's also some more general ergonomic principle.

I am able to buy the small advantage over code complexity ratio
argument you once explained. But I somewhat regret having to buy it.

BTW my locale thousands_sep gives a   non-breaking space, and
locale decimal_point gives a , comma.


 As for localization, I'm not against it. The argument was that, where
 possible, I prefer the output of applications to remain parsable.

So we disagree only on the balance. I'd say output to humans should
be localized as much as possible, unless this creates a really serious
problem for the machine parsing secondary usage.

Where incompatible, human and machine output may be separated. Say
on option, or like GnuPG --status-fd simultaneously: Human reads
stdout/err, while machine parses another fd. That's material for present
debate, not my wish for wget.


 I consider the ISO 8601 date format a clear advantage over the
 asctime() format.

;-) Good example: I *hate* having to read 8601 dates. Nearly as much
as having to read those other dates, localized or not, with month/day
ambiguity. MHO only, here: I know some people love 8601.


Bye!Alain.
-- 
« if you believe subversive history books, I've got a bridge to sell you. »


Re: ChangeLog-branches

2005-06-24 Thread Alain Bench
Hello Hrvoje,

 On Thursday, June 23, 2005 at 9:00:44 PM +0200, Hrvoje Niksic wrote:

 the ChangeLog-branches directories distributed with Wget are desirable
 or necessary?

MHO: They are ununderstandable, unusable, unclean, and big. They may
give a false bad impression of source/project misorganization. We want
to drop them, wipe any proof of their existence from any archives and
mirrors, then honestly deny they ever existed. No need to kill witnesses
though: Who would believe them?


Bye!Alain.
-- 
Microsoft Outlook Express users concerned about readability: For much
better viewing quotes in your messages, check the little freeware
program OE-QuoteFix by Dominik Jain on URL:http://flash.to/oblivion/.
It'll change your life. :-) Now exists also for Outlook.


Re: Removing thousand separators from file size output

2005-06-24 Thread Alain Bench
 On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote:

 Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am
 thinking of removing the thousand separators from size display.

IMHO thousand (or myriad) separators are necessary.

This size display is primarily intended for humans, not for other
apps. If separators constitute a difficulty for other apps, then it's
these other apps problem. Or sed's task (s/,//g).

Humans can have habit to look at exact unit size, or rounded
kilo/mega/tera size, or both. It would be a regression to reduce
readability of legacy exact bytes count, just because we have a new
added more human-readable but rounded count.


 The separators are interpunction which introduces clutter, especially
 with complex size output also containing the remaining size next to
 the whole size.

True: The more info, the more confusion. But that's the contrary of
a valid reason to reduce readability of those infos. And IMHO removing
thousand separators reduces readability.


 replace the , character with the character mandated by the locale

This seems naturally desirable. I don't really understand nor follow
your reasons against localization. User's cultural preferences should be
respected.

OTOS this is not so important nor urgent, compared to thousand
serparators removal cons.


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: Character encoding

2005-04-06 Thread Alain Bench
Hello Georg,

 On Friday, April 1, 2005 at 12:01:15 PM +0200, Georg Bauhaus wrote:

 The apostrophy might have been typed as an accent (acute) really

Most probably the RIGHT SINGLE QUOTATION MARK U+2019, , encoded
in UTF-8, then wrongly seen as being CP-1252. It would look like 
(a circumflex, euro symbol, trademark sign), and once transliterated to
Latin-1 like EUR(tm).


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: Gmane

2005-02-20 Thread Alain Bench
Hello Hrvoje, wishing you all well!

 On Saturday, February 19, 2005 at 6:20:52 PM +0100, Hrvoje Niksic wrote:

 I propose to make this list available via gmane, www.gmane.com. It
 buys us good archiving, as well as NNTP access. Is there anyone who
 would object to that?

There are pros and cons. Wider audience and potential contributors.
But greater exposition to spam, both of the list and it's members. And
that infamous gmane message-id overwriting, that breaks our threads.


BTW Hrvoje, do you want a gzipped mbox of missed wget-patches posts?
Please give me date boundaries, I'll be happy to help.


Bye!Alain.
-- 
Microsoft Outlook Express users concerned about readability: For much
better viewing quotes in your messages, check the little freeware
program OE-QuoteFix by Dominik Jain on URL:http://flash.to/oblivion/.
It'll change your life. :-) Now exists also for Outlook.


Re: utf-8 encoded html documents

2005-02-05 Thread Alain Bench
Hello George,

 On Tuesday, February 1, 2005 at 7:49:55 AM -0800, George Prekas wrote:

 I am using wget 1.9.1 under Windows XP and I have noticed that it is
 completely incapable of handling utf-8 encoded html documents.

I am not aware of any problem with UTF-8 pages: Just work fine. What
error do you get, and what correct result would you want? What sort of
handling are you talking about?

If it's transcoding from UTF-8 to whatever charset you use, that's
probably not Wget's job. Good browsers should be able to read UTF-8 file
(unless malformed header). Otherwise there are recoding tools.


 check it out for your self here: http://www.tsiamoulisschools.gr

I only get: does not exist (Authoritative answer).


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: Suggestion, --range

2004-10-01 Thread Alain Bench
Hello Robert,

 On Thursday, September 30, 2004 at 6:36:43 PM +0200, Robert Thomson wrote:

 It would be really advantageous if wget had a --range command line
 argument, that would download a range of bytes of a file, if the
 server supports it.

You could try the feature patch posted by Rodrigo S. Wanderley last
year on the wget mailing list. The guy made the work, and nobody gave
feedback :-\. See [EMAIL PROTECTED].


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: timeout on closing connection

2003-12-01 Thread Alain Bench
 On Saturday, November 29, 2003 at 4:15:19 PM +0100, Hrvoje Niksic wrote:

 Alain Bench [EMAIL PROTECTED] writes:
 I sometimes seem to be stuck in an overly long (like more than 1
 hour) timeout on closing connection
 during the kernel close() call? Did you confirm that with trace?

No, but I'll try strace next time it happens. I don't really
understand what's going on. It's only I saw Wget still running long
after hangup, and netstat showed odd things as a connection still in
closing state (FIN wait 1 or such). Killing Wget by ^C cleaned the
netstat. Nothing more precise, and of course now I'd want to analyse
things, it doesn't happen anymore... ;-)


Bye!Alain.


timeout on closing connection

2003-11-29 Thread Alain Bench
Hello,

Wget 1.9.1: I sometimes seem to be stuck in an overly long (like
more than 1 hour) timeout on closing connection, when server went down
or modem hangup during a read or just before close. I use Wget's default
timeout (0, 0, 900), or sometimes --timeout=30 (30, 30, 30), and
understand it's for name resolution, initial connect, and read. But what
about close?


Bye!Alain.
-- 
Give your computer's unused idle processor cycles to a scientific goal:
The [EMAIL PROTECTED] project at URL:http://genomeathome.stanford.edu/.


Re: keep alive connections

2003-11-12 Thread Alain Bench
 On Tuesday, November 11, 2003 at 2:41:31 PM +0100, Hrvoje Niksic wrote:

 Alain Bench [EMAIL PROTECTED] writes:
 with --timestamping: Each HEAD and each possible GET uses a new
 connection.
 I think the difference is that Wget closes the connection when it
 decides not to read the request body.

OK, wasn't aware of the spurious HEAD bodies problem. But Wget also
closes the connection between a GET (with body) and the HEAD for the
next file.


 But maybe it would actually be a better idea to read (and discard) the
 body than to close the connection and reopen it.

Hum... Would it be possible to close/reopen only if, and as soon as,
first byte of spurious body comes? I guess it could be difficult to deal
cleanly with next file in limit cases...


| Keep-Alive: timeout=15, max=5
 Without --timestamping Wget keeps Reusing fd 3. and closing it only
 once every 6 files (first + 5 more).
 This might be due to redirections.

No redirections involved: That closure is normal, due to the max=5
the server responds to the first request. At second GET it's max=4 and
gets decremented each time. Finally at the 6th request there is no more
Connection: nor Keep-Alive: fields. The /etc/apache/httpd.conf says:

| # KeepAlive: The number of Keep-Alive persistent requests to accept
| # per connection. Set to 0 to deactivate Keep-Alive support
| KeepAlive 5
|
| # KeepAliveTimeout: Number of seconds to wait for the next request
| KeepAliveTimeout 15


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: keep alive connections

2003-11-11 Thread Alain Bench
Hello Hrvoje,

 On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote:

 Wget uses the `Keep-Alive' request header to request persistent
 connections, and understands both the HTTP/1.0 `Keep-Alive' and the
 HTTP/1.1 `Connection: keep-alive' response header.

This doesn't seem to work together with --timestamping: Each HEAD
and each possible GET uses a new connection. The server keeps
responding:

| HTTP/1.0 200 OK
| [...]
| Connection: Keep-Alive
| Keep-Alive: timeout=15, max=5

But Wget 1.9 does each time:

| Created socket 3.
| [snip request/response]
| Registered fd 3 for persistent reuse.
| Closing fd 3
| Invalidating fd 3 from further reuse.
| Remote file is newer, retrieving.
| Created socket 3.
| [and so on]

Tcpdump confirms the TCP session is FIN closed by Wget.

Without --timestamping Wget keeps Reusing fd 3. and closing it
only once every 6 files (first + 5 more). At this moment the FIN would
in any case be initiated by the server if not by Wget. Test made on an
old Apache 1.1.3, but it seems the same with other servers.


BTW, it's nice to see you back and active, Hrvoje! :-)


Bye!Alain.
-- 
Mutt 1.5.5.1 is released.


Re: Another space problem

2002-07-16 Thread Alain Bench

Hello Matt,

 On Sunday, July 14, 2002 at 1:51:28 PM +1200, Matt wrote:

 The actual command in the script is: wget [...] $1
 However, sometimes the directories have spaces in them.

That's not a wget issue, just a basic script programming one: You
must quote the parameter also inside the script, as $1.


 From: Matt matt[EMAIL PROTECTED]
 Reply-To: mattsarah@n!o!s!p!a!m!email.message.co_nz

 You may need to remove n!o!s!p!a!m! to reply

Interesting: made this way, it's _only_ an annoyance for repliers,
not at all for spammers. Is it really what you intended to do?


HTH, and bye!   Alain.