Re: [Bug-wget] [PATCH 2/2] Rewrite the --rejected-log test using the new framework.

2015-08-07 Thread Giuseppe Scrivano
Jookia 166...@gmail.com writes:

  * tests/Test--rejected-log.px: Remove old test.
  * testenv/Test--rejected-log.py: Create new test.
 ---
  testenv/Makefile.am   |   1 +
  testenv/Test--rejected-log.py | 104 +++
  tests/Makefile.am |   1 -
  tests/Test--rejected-log.px   | 138 
 --
  4 files changed, 105 insertions(+), 139 deletions(-)
  create mode 100755 testenv/Test--rejected-log.py
  delete mode 100755 tests/Test--rejected-log.px

Fast :-)  Thanks for the additional test.

Regards,
Giuseppe



Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Tim Ruehsen
Hi Andries,

as I already mentioned, changing the default behavior of wget is not a good 
idea.

But I started a wget2 branch that produces wget and wget2 executables.
wget2's default behavior is to keep filenames as they are.

I am not sure how it compiles and works on Windows (Cygwin could work).
If you dare to check it out: any feedback is highly welcome.

Regards, Tim

On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote:
 Today I again downloaded a large tree with wget and got only unusable
 filenames. Fortunately I have the utility wgetfix that repairs the
 consequences of this bug (see
 http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this
 wget bug should be fixed.
 
 (Maybe it has been fixed already? I looked at this in detail last year,
 and there was some correspondence but I think nothing happened.
 Have not looked at the latest sources.)
 
 What happens is that wget under certain circumstances escapes
 certain bytes in a filename. I think that this was always a mistake,
 but it did not occur very much and was defendable: filenames with
 embedded control characters are a pain.
 
 Today the situation is just the opposite: when copying from a remote
 utf8 system to a local utf8 system correct and normal filenames
 are escaped to create illegal filenames that cannot be used
 and are worse than a pain, one cannot do much else than discard them.
 
 What can the user do?
 
 If she is on Windows, she is told to switch to Linux:
  I can't help Windows users, but Wget is a power-user tool.
  And a Windows power-user should be able to start a virtual
  machine with Linux running to use tools like Wget.
 
 Is she is on Linux, the easiest is to discard all that was downloaded
 and start over again, this time with the option
 --restrict-file-names=nocontrol
 
 If the user knows about wgetfix, that is an alternative.
 
 One can also use curl instead of wget.
 
 See also
 
 http://savannah.gnu.org/bugs/?37564
 http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
 http://stackoverflow.com/questions/27054765/wget-japanese-characters
 http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin
 g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html
 
 Below I suggested an easy fix, and discussed some details.
 
 Andries
 
 On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
  On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
   On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
   If I ask wget to download the wikipedia page
   
   http://he.wikipedia.org/wiki/ש._שפרה
   
   then I hope for a resulting file ש._שפרה.
   Instead, wget gives me ש._שפר\327%94, where the \327
   is an unpronounceable byte that cannot be typed
   (This is an UTF-8 system and the filename
   that wget produces is not valid UTF-8.)
   
   Maybe it would be better if wget by default used the original filename.
   This name mangling is a vestige of old times, it seems to me.
   
   This is a commonly reported grievance and as you correctly mention a
   vestige of old times. With UTF-8 supported filesystems, Wget should
   simply write the correct characters.
   
   I sincerely hope this issue is resolved as fast as possible, but I
   know not how to. Those who understand i18n should work on this.
  
  It is very easy to resolve the issue, but I don't know how backwards
  compatible the wget developers want to be.
  
  The easiest solution is to change the line (in init.c:defaults())
  
  opt.restrict_files_ctrl = true;
  
  into
  
  opt.restrict_files_ctrl = false;
  
  That is what I would like to see:
  the default should be to preserve the name as-is,
  and there should be options escape_control or so
  to force the current default behaviour.
  
  There are also more complicated solutions.
  One can ask for LC_CTYPE or LANG or some such thing,
  and try to find out whether the current system is UTF-8,
  and only in that case set restrict_files_ctrl to false.
  
  I don't know anything about the Windows environment.
  
  Andries
  
  
  [Discussion:
  
  There is a flag --restrict-file-names. The manual page says
  By default, Wget escapes the characters that are not valid or safe
  
   as part of file names on your operating system, as well as control
   characters that are typically unprintable.
  
  Presently this is false: On a UTF-8 system Wget by default introduces
  illegal characters. The option nocontrol is needed to preserve the
  correct name.
  
  The flag is handled in init.c:cmd_spec_restrict_file_names()
  where opt.restrict_files_{os,case,ctrl,nonascii} are set.
  Of interest is the restrict_files_ctrl flag.
  Today init.c does by default:
  
  #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
  
opt.restrict_files_os = restrict_windows;
  
  #else
  
opt.restrict_files_os = restrict_unix;
  
  #endif
  
opt.restrict_files_ctrl = true;
opt.restrict_files_nonascii = false;

Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Tim Ruehsen
On Friday 07 August 2015 16:38:01 Andries E. Brouwer wrote:
 On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote:
  Hi Andries,
  
  as I already mentioned, changing the default behavior of wget is not a
  good
  idea.
  
  But I started a wget2 branch that produces wget and wget2 executables.
  wget2's default behavior is to keep filenames as they are.
  
  I am not sure how it compiles and works on Windows (Cygwin could work).
  If you dare to check it out: any feedback is highly welcome.
  
  Regards, Tim
 
 Hi Tim,
 
 I disagree. This is just a bug.
 Nobody wants illegal filenames.
 Even removing them is not entirely trivial since the filenames
 produced by wget are not legal character sequences, so cannot be typed.

Hi Andries,

obviously I got it wrong.

If it's a bug, let's just fix it (without breaking compatibility).

I don't have the time to read *all* the old emails right now.
But as far as I understand escaping occurs within legal UTF-8 sequences - and 
you are right when saying this is a bug when we have a UTF-8 locale.

The solution would something like

if locale is UTF-8
  do not escape valid UTF-8 sequences
else
  keep wget's current behavior

If URLs (and thus filenames) are not in UTF-8, Wget will convert them to UTF-8 
before the above procedure (I guess that is what wget does anyways, well not 
100% sure).

Would you agree ?

If you provide patch for this we will appreciate that.

 I am a Linux man, no Windows computers here. So, I am happy to do
 stuff on Linux, but cannot test on Windows.

Sorry, won't bother you again regarding Windows ;-)

Tim




Re: [Bug-wget] bad filenames (again)

2015-08-07 Thread Andries E. Brouwer
On Fri, Aug 07, 2015 at 04:14:45PM +0200, Tim Ruehsen wrote:
 Hi Andries,
 
 as I already mentioned, changing the default behavior of wget is not a good 
 idea.
 
 But I started a wget2 branch that produces wget and wget2 executables.
 wget2's default behavior is to keep filenames as they are.
 
 I am not sure how it compiles and works on Windows (Cygwin could work).
 If you dare to check it out: any feedback is highly welcome.
 
 Regards, Tim

Hi Tim,

I disagree. This is just a bug.
Nobody wants illegal filenames.
Even removing them is not entirely trivial since the filenames
produced by wget are not legal character sequences, so cannot be typed.

So, I think this should be fixed, for example with my one-liner fix,
but I am quite happy to do something more complicated if that is
what people prefer.

I am a Linux man, no Windows computers here. So, I am happy to do
stuff on Linux, but cannot test on Windows.

Andries