Hi Andries,
as I already mentioned, changing the default behavior of wget is not a good
idea.
But I started a wget2 branch that produces wget and wget2 executables.
wget2's default behavior is to keep filenames as they are.
I am not sure how it compiles and works on Windows (Cygwin could work).
If you dare to check it out: any feedback is highly welcome.
Regards, Tim
On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote:
Today I again downloaded a large tree with wget and got only unusable
filenames. Fortunately I have the utility wgetfix that repairs the
consequences of this bug (see
http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this
wget bug should be fixed.
(Maybe it has been fixed already? I looked at this in detail last year,
and there was some correspondence but I think nothing happened.
Have not looked at the latest sources.)
What happens is that wget under certain circumstances escapes
certain bytes in a filename. I think that this was always a mistake,
but it did not occur very much and was defendable: filenames with
embedded control characters are a pain.
Today the situation is just the opposite: when copying from a remote
utf8 system to a local utf8 system correct and normal filenames
are escaped to create illegal filenames that cannot be used
and are worse than a pain, one cannot do much else than discard them.
What can the user do?
If she is on Windows, she is told to switch to Linux:
I can't help Windows users, but Wget is a power-user tool.
And a Windows power-user should be able to start a virtual
machine with Linux running to use tools like Wget.
Is she is on Linux, the easiest is to discard all that was downloaded
and start over again, this time with the option
--restrict-file-names=nocontrol
If the user knows about wgetfix, that is an alternative.
One can also use curl instead of wget.
See also
http://savannah.gnu.org/bugs/?37564
http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
http://stackoverflow.com/questions/27054765/wget-japanese-characters
http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin
g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html
Below I suggested an easy fix, and discussed some details.
Andries
On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
If I ask wget to download the wikipedia page
http://he.wikipedia.org/wiki/ש._שפרה
then I hope for a resulting file ש._שפרה.
Instead, wget gives me ש._שפר\327%94, where the \327
is an unpronounceable byte that cannot be typed
(This is an UTF-8 system and the filename
that wget produces is not valid UTF-8.)
Maybe it would be better if wget by default used the original filename.
This name mangling is a vestige of old times, it seems to me.
This is a commonly reported grievance and as you correctly mention a
vestige of old times. With UTF-8 supported filesystems, Wget should
simply write the correct characters.
I sincerely hope this issue is resolved as fast as possible, but I
know not how to. Those who understand i18n should work on this.
It is very easy to resolve the issue, but I don't know how backwards
compatible the wget developers want to be.
The easiest solution is to change the line (in init.c:defaults())
opt.restrict_files_ctrl = true;
into
opt.restrict_files_ctrl = false;
That is what I would like to see:
the default should be to preserve the name as-is,
and there should be options escape_control or so
to force the current default behaviour.
There are also more complicated solutions.
One can ask for LC_CTYPE or LANG or some such thing,
and try to find out whether the current system is UTF-8,
and only in that case set restrict_files_ctrl to false.
I don't know anything about the Windows environment.
Andries
[Discussion:
There is a flag --restrict-file-names. The manual page says
By default, Wget escapes the characters that are not valid or safe
as part of file names on your operating system, as well as control
characters that are typically unprintable.
Presently this is false: On a UTF-8 system Wget by default introduces
illegal characters. The option nocontrol is needed to preserve the
correct name.
The flag is handled in init.c:cmd_spec_restrict_file_names()
where opt.restrict_files_{os,case,ctrl,nonascii} are set.
Of interest is the restrict_files_ctrl flag.
Today init.c does by default:
#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
opt.restrict_files_os = restrict_windows;
#else
opt.restrict_files_os = restrict_unix;
#endif
opt.restrict_files_ctrl = true;
opt.restrict_files_nonascii = false;