[bug #60494] Percent character in filename gets escaped twice

2021-05-22 Thread Tim Ruehsen
Follow-up Comment #8, bug #60494 (project wget):

So what KodeCharlie says is correct (regarding b)).
I would reword it like: We have arbitrary user input which has nothing to do
what RFCs say. And this is the hard part as we have to 'guess' what the user
meant.
Once the input is 'normalized' (unescaped, charset translated (into utf-8),
protcoll extended, ...), the rest is straight forward following the RFCs.

@PetrPisar Regarding the filename: it is also user input. And the problem is
that the wget author(s) made some decisions in the past on how to treat user
input. There is no black and white here and any decision has it's pros and
cons.
I think that part of the problem is that URLs on web sites are often printed
in their escaped form. And wget users explicitly wanted to use copy
(from web site to console).

Then the next aspect is: we don't want to change a long-standing (default)
behavior. This breaks (production) scripts and command lines. What we can
possibly do is to add a new '--strict-input' option that skips 'guessing' and
instead assumes a 100% valid URL. BTW, this is a good idea for wget2 ;-)

I agree that "wget [option]... [URL]..." is not 100% correct in terms of RFCs.
But wget is also a user tool, and normal users don't have the RFCs in mind
when they think about URLs.



___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-16 Thread Kode Charlie
Follow-up Comment #7, bug #60494 (project wget):

My $0.02:  there's (a) the specification, and then there's (b) the reality of
what's out there in the wild.  Inline docs for *wget* make it clear that at
least part of the logic is concerned with (b).

An inefficient but possibly easy fix to this bug might simply be to make a
deep copy of the *struct url* each time *url_escape()* is called. This issue,
after all, has grown out of an optimization (ie, in-place string-edits) that
isn't working quite as expected.


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-16 Thread Petr Pisar
Follow-up Comment #6, bug #60494 (project wget):

You cannot state a question like that because a random string is ambiguous by
it's nature.

According to the specification
 there is nothing
as an unescaped URI. URI is always escaped by the definition.

Look at the original report: You have a file name
"qtwebengine-everywhere-src-5.15.2-%231904652.patch.gz". It's a file name. Not
an URI. If you construct a URL for the file name using an
"https://mirrors.slackware.com/slackware/slackware64-current/source/l/qt5/patches/;
base URL, then you need to escape the file name string and then append it it
after a path delimiter of the base URL. I.e. you convert the file name to
"qtwebengine-everywhere-src-5.15.2-%25231904652.patch.gz" and then append it
to the base resulting into
"https://mirrors.slackware.com/slackware/slackware64-current/source/l/qt5/patches/qtwebengine-everywhere-src-5.15.2-%25231904652.patch.gz;
URL. This URL is passed to to wget command. Thus wget should not escape it
again. It could validate and report an error. But not escape it.

I will quote the specification here:

   Under normal circumstances, the only time when octets within a URI
   are percent-encoded is during the process of producing the URI from
   its component parts.  This is when an implementation determines which
   of the reserved characters are to be used as subcomponent delimiters
   and which can be safely used as data.  Once produced, a URI is always
   in its percent-encoded form.

Please, pay attention to the last sentence.

Of course wget could state that its argument is a byte stream without any
other constrains. But a manual of wget(1) reads something different, it states
it's a URL:

SYNOPSIS
   wget [option]... [URL]...

Hence wget should not attempt any escaping.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-16 Thread Tim Ruehsen
Follow-up Comment #5, bug #60494 (project wget):

The main Q is how does Wget know if the input URL is already escaped or not ?

Did you read this comment ?
https://gitlab.com/gnuwget/wget/-/blob/master/src/url.c#L329

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-15 Thread Kode Charlie
Follow-up Comment #4, bug #60494 (project wget):

Here's one proposed change (see patch below).  It likely is not comprehensive
enough.  

The broader issue seems to be that there are circumstances in *./src/retr.c*
when *retrieve_url()* calls *url_parse()*, which already was called in
*./src/main.c*.  This can result in the *file* property in *struct url* being
modified in-place more than once (triggering errors as the one originally
reported in this issue).

For now, I'm going to defer making these changes (including suggested patch
below) to those more familiar with the *wget* repo than I.


diff --git a/src/http.c b/src/http.c
index d25762fa..a9fa237f 100644
--- a/src/http.c
+++ b/src/http.c
@@ -4280,8 +4280,10 @@ http_loop (const struct url *u, struct url
*original_url, char **newloc,
 }
   else if (!opt.content_disposition)
 {
-  hstat.local_file =
-url_file_name (opt.trustservernames ? u : original_url, NULL);
+  if (opt.trustservernames)
+hstat.local_file = url_file_name (u, NULL);
+  else
+hstat.local_file = original_url->file;
   got_name = true;
 }


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-15 Thread Kode Charlie
Follow-up Comment #3, bug #60494 (project wget):

I have an explanation but as yet, no fix.

In *./src/url.c*, the routine *url_unescape()* is called twice for the same
path component (ie, the basename).  Since *url_unescape_1()* modifies the path
component in-place, the effect of processing:

qtwebengine-everywhere-src-5.15.2-%25231904652.patch.gz

twice is that after the first pass, the substring *%2523* is converted to
*%23*, and then after the second pass to *#*.

Hence, the resulting filename:

qtwebengine-everywhere-src-5.15.2-#1904652.patch.gz


I'm looking into a fix.

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-15 Thread Kode Charlie
Follow-up Comment #2, bug #60494 (project wget):

Correction: based on debug-logs, quoted logic does seem ok. I am looking into
other logic _before_ convert_fname() (which calls _iconv()_) is called.

[comment #1 comment #1:]
> system: macOS Big Sur v. 11.2.3
> wget: master branch
> 
> Offending logic from *./src/url.c* is shown below. I believe the issue is in
*iconv()*.
> 
> 
>   if (iconv (cd, (ICONV_CONST char **) , , , ) ==
0
>   && iconv (cd, NULL, NULL, , ) == 0)
> {
>   *(converted_fname + len - outlen - done) = '\0';
>   iconv_close (cd);
>   DEBUGP (("Converted file name '%s' (%s) -> '%s' (%s)\n",
>orig_fname, from_encoding, converted_fname,
to_encoding));
>   xfree (orig_fname);
>   return converted_fname;
> }
> 

___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-15 Thread Kode Charlie
Follow-up Comment #1, bug #60494 (project wget):

system: macOS Big Sur v. 11.2.3
wget: master branch

Offending logic from *./src/url.c* is shown below. I believe the issue is in
*iconv()*.


  if (iconv (cd, (ICONV_CONST char **) , , , ) == 0
  && iconv (cd, NULL, NULL, , ) == 0)
{
  *(converted_fname + len - outlen - done) = '\0';
  iconv_close (cd);
  DEBUGP (("Converted file name '%s' (%s) -> '%s' (%s)\n",
   orig_fname, from_encoding, converted_fname, to_encoding));
  xfree (orig_fname);
  return converted_fname;
}


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60494] Percent character in filename gets escaped twice

2021-05-01 Thread anonymous
URL:
  

 Summary: Percent character in filename gets escaped twice
 Project: GNU Wget
Submitted by: None
Submitted on: Sat 01 May 2021 04:53:06 PM UTC
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
  Status: None
 Privacy: Public
 Assigned to: None
 Originator Name: Ed Grochowski
Originator Email: 
 Open/Closed: Open
 Release: trunk
 Discussion Lock: Any
Operating System: GNU/Linux
 Reproducibility: Every Time
   Fixed Release: None
 Planned Release: None
  Regression: None
   Work Required: None
  Patch Included: No

___

Details:

On Slackware64-current, the repository has a file whose name contains a
percent character.  The original name is
qtwebengine-everywhere-src-5.15.2-%231904652.patch.gz

When I download this file with wget 1.21.1, the saved filename is wrong.  The
percent character is being escaped twice, once correctly and a second time
superfluously.

% wget
https://mirrors.slackware.com/slackware/slackware64-current/source/l/qt5/patches/qtwebengine-everywhere-src-5.15.2-%25231904652.patch.gz
--2021-04-30 16:31:02-- 
https://mirrors.slackware.com/slackware/slackware64-current/source/l/qt5/patches/qtwebengine-everywhere-src-5.15.2-%25231904652.patch.gz
Resolving mirrors.slackware.com (mirrors.slackware.com)... 207.223.116.213
Connecting to mirrors.slackware.com
(mirrors.slackware.com)|207.223.116.213|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1653 (1.6K) [application/x-gzip]
Saving to: ‘qtwebengine-everywhere-src-5.15.2-#1904652.patch.gz’

qtwebengine-everywh 100%[===>]   1.61K  --.-KB/sin 0s 


2021-04-30 16:31:02 (676 MB/s) -
‘qtwebengine-everywhere-src-5.15.2-#1904652.patch.gz’ saved [1653/1653]

% ls -l
total 4
-rw-r--r-- 1 ed users 1653 Feb  8 13:44
qtwebengine-everywhere-src-5.15.2-#1904652.patch.gz



___

File Attachments:


---
Date: Sat 01 May 2021 04:53:06 PM UTC  Name: wget-debug.txt  Size: 2KiB   By:
None



___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/