Re: [Bug-wget] Unexpected result with -H and -D

2018-01-18 Thread Friso van Vollenhoven
Hi,
Thanks for confirming it's a bug. I'm currently not fluent enough in C to
provide a fix myself, but I see a patch was already posted, so I hope
that's satisfactory.

Cheers,
Friso


On Wed, Jan 17, 2018 at 3:01 PM, Darshit Shah  wrote:

> Hi,
>
> This is a bug in Wget, apparently a really old one! Seems like the bug has
> been
> around since atleast 1997.
>
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
>
> I'm going to try and implement this correctly, but I'm currently a little
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
>
>
> Of course, this change will require libpsl to no longer be an optional
> dependency
>
> * Friso van Vollenhoven  [180117 14:40]:
> > Hello all,
> >
> > I am trying to do a recursive download of a webpage and span multiple
> hosts
> > within the same domain, but not cross to other domains. The issue is that
> > the crawl does extend to other domains. My full command is this:
> >
> > wget \
> > --recursive \
> > --no-clobber \
> > --page-requisites \
> > --adjust-extension \
> > --span-hosts \
> > --domains=scapino.nl \
> > --no-parent \
> > --tries=2 \
> > --wait=1 \
> > --random-wait \
> > --waitretry=2 \
> > --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132
> Safari/537.36' \
> > https://www.scapino.nl/winkels/scapino-utrecht-510061
> >
> > From this combination of --span-hosts and --domains, I would expect to
> > download assets from cdn.scapino.nl and www.scapino.nl, but not other
> > domains. For some reason that I don't understand, wget also starts to do
> > what looks like a full crawl of the domain werkenbijscapino.nl, which is
> > referenced from the original page.
> >
> > Any thoughts or direction would be much appreciated.
> >
> > I am using wget 1.18 on Debian.
> >
> >
> > Best regards,
> > Friso
>
> --
> Thanking You,
> Darshit Shah
> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
>


Re: [Bug-wget] Unexpected result with -H and -D

2018-01-17 Thread Tim Rühsen
Hi,

this is not a PSL matching, so no libpsl is needed.

Just sufmatch() has to be fixed to do (sub)domain matching.

Attached is a fix.


With Best Regards, Tim



On 01/17/2018 03:01 PM, Darshit Shah wrote:
> Hi,
> 
> This is a bug in Wget, apparently a really old one! Seems like the bug has 
> been
> around since atleast 1997.
> 
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
> 
> I'm going to try and implement this correctly, but I'm currently a little 
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
> 
> 
> Of course, this change will require libpsl to no longer be an optional
> dependency
> 
> * Friso van Vollenhoven  [180117 14:40]:
>> Hello all,
>>
>> I am trying to do a recursive download of a webpage and span multiple hosts
>> within the same domain, but not cross to other domains. The issue is that
>> the crawl does extend to other domains. My full command is this:
>>
>> wget \
>> --recursive \
>> --no-clobber \
>> --page-requisites \
>> --adjust-extension \
>> --span-hosts \
>> --domains=scapino.nl \
>> --no-parent \
>> --tries=2 \
>> --wait=1 \
>> --random-wait \
>> --waitretry=2 \
>> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
>> https://www.scapino.nl/winkels/scapino-utrecht-510061
>>
>> From this combination of --span-hosts and --domains, I would expect to
>> download assets from cdn.scapino.nl and www.scapino.nl, but not other
>> domains. For some reason that I don't understand, wget also starts to do
>> what looks like a full crawl of the domain werkenbijscapino.nl, which is
>> referenced from the original page.
>>
>> Any thoughts or direction would be much appreciated.
>>
>> I am using wget 1.18 on Debian.
>>
>>
>> Best regards,
>> Friso
> 
From 1ad636baa63cfe029c84235986626c17e4ff33cb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tim=20R=C3=BChsen?= 
Date: Wed, 17 Jan 2018 15:50:48 +0100
Subject: [PATCH] * src/host.c (sufmatch): Fix to domain matching

---
 src/host.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/src/host.c b/src/host.c
index 2ddae328..d337cc7c 100644
--- a/src/host.c
+++ b/src/host.c
@@ -1017,18 +1017,25 @@ sufmatch (const char **list, const char *what)
   int i, j, k, lw;
 
   lw = strlen (what);
+
   for (i = 0; list[i]; i++)
 {
-  if (list[i][0] == '\0')
-continue;
+  j = strlen (list[i]);
+  if (lw < j)
+continue; /* what is no (sub)domain of list[i] */
 
-  for (j = strlen (list[i]), k = lw; j >= 0 && k >= 0; j--, k--)
+  for (k = lw; j >= 0 && k >= 0; j--, k--)
 if (c_tolower (list[i][j]) != c_tolower (what[k]))
   break;
-  /* The domain must be first to reach to beginning.  */
-  if (j == -1)
+
+  /* Domain or subdomain match
+   * k == -1: exact match
+   * k >= 0 && what[k] == '.': subdomain match
+   */
+  if (j == -1 && (k == -1 || what[k] == '.'))
 return true;
 }
+
   return false;
 }
 
-- 
2.15.1



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Unexpected result with -H and -D

2018-01-17 Thread Darshit Shah
Hi,

This is a bug in Wget, apparently a really old one! Seems like the bug has been
around since atleast 1997.

Looking at the source, the issue is that Wget does a very simple suffix
matching on the actual domain and accepted domains list. This is obviously
wrong as you have just found out.

I'm going to try and implement this correctly, but I'm currently a little short
on time, so if anyone else wants to pick it up, please feel free to. It's
simple, use libpsl to get the proper domain name and match against that.


Of course, this change will require libpsl to no longer be an optional
dependency

* Friso van Vollenhoven  [180117 14:40]:
> Hello all,
> 
> I am trying to do a recursive download of a webpage and span multiple hosts
> within the same domain, but not cross to other domains. The issue is that
> the crawl does extend to other domains. My full command is this:
> 
> wget \
> --recursive \
> --no-clobber \
> --page-requisites \
> --adjust-extension \
> --span-hosts \
> --domains=scapino.nl \
> --no-parent \
> --tries=2 \
> --wait=1 \
> --random-wait \
> --waitretry=2 \
> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
> https://www.scapino.nl/winkels/scapino-utrecht-510061
> 
> From this combination of --span-hosts and --domains, I would expect to
> download assets from cdn.scapino.nl and www.scapino.nl, but not other
> domains. For some reason that I don't understand, wget also starts to do
> what looks like a full crawl of the domain werkenbijscapino.nl, which is
> referenced from the original page.
> 
> Any thoughts or direction would be much appreciated.
> 
> I am using wget 1.18 on Debian.
> 
> 
> Best regards,
> Friso

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6


signature.asc
Description: PGP signature


[Bug-wget] Unexpected result with -H and -D

2018-01-17 Thread Friso van Vollenhoven
Hello all,

I am trying to do a recursive download of a webpage and span multiple hosts
within the same domain, but not cross to other domains. The issue is that
the crawl does extend to other domains. My full command is this:

wget \
--recursive \
--no-clobber \
--page-requisites \
--adjust-extension \
--span-hosts \
--domains=scapino.nl \
--no-parent \
--tries=2 \
--wait=1 \
--random-wait \
--waitretry=2 \
--header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
https://www.scapino.nl/winkels/scapino-utrecht-510061

>From this combination of --span-hosts and --domains, I would expect to
download assets from cdn.scapino.nl and www.scapino.nl, but not other
domains. For some reason that I don't understand, wget also starts to do
what looks like a full crawl of the domain werkenbijscapino.nl, which is
referenced from the original page.

Any thoughts or direction would be much appreciated.

I am using wget 1.18 on Debian.


Best regards,
Friso