RE: --reject issues

Dominique De Munck Thu, 14 Aug 2003 13:08:04 -0700

Hi there,

I got the same problems as Volker, so I took a look into the source
code:


line 527 in cvs src/recur.c checks if it is necessary to download the file :

"if (!acceptable (u->file))"

This only checks the first part of the filename, e.g. :
on the following url:
 domin.studentenweb.org/user.php?op=register&module=NS-NewUser

it will ONLY try to match "user.php"
While on a whole lot of sites, the things coming after .php determine the 
content.
This is inconsistent, because after it downloaded the file (e.g. if you used
(--reject "*register*"), it will check the WHOLE filename 
("user.php?op=register&module=NS-NewUser"), and it will be deleted after it 
got downloaded after all (like Volker says).

this is done on line 366 
"if (opt.delete_after || (file && !acceptable (file)))"

I suggest 2 things
1)
change the line 527 in 

 "if (!acceptable (u->url))"

(or something better, but this worked for me & will match on
"user.php?op=register&module=NS-NewUser" )

2)and  (since we're on the topic): offer a possibility to even reject html 
files that match the reject clause, because that could be also very 
convenient (I would even make it the default)

greets,

Dominique De Munck

a small diff :

-527c527
-<       if (!acceptable (u->url))
----
->       if (!acceptable (u->file))

a bigger one :


diff --unified --recursive --new-file wgetAdapted/src/patch wget/src/patch
--- wgetAdapted/src/patch       2003-08-11 08:56:41.000000000 +0200
+++ wget/src/patch      1970-01-01 01:00:00.000000000 +0100
@@ -1,4 +0,0 @@
-527c527
-<       if (!acceptable (u->url))
----
->       if (!acceptable (u->file))
diff --unified --recursive --new-file wgetAdapted/src/recur.c wget/src/recur.c
--- wgetAdapted/src/recur.c     2003-08-11 09:18:01.000000000 +0200
+++ wget/src/recur.c    2002-07-24 23:16:30.000000000 +0200
@@ -524,7 +524,7 @@
           && depth != INFINITE_RECURSION
           && depth < opt.reclevel - 1))
     {
-      if (!acceptable (u->url))
+      if (!acceptable (u->file))
        {
          DEBUGP (("%s (%s) does not match acc/rej rules.\n",
                   url, u->file));


ORIGINAL MESSAGE :

-- I am trying to exclude certain file name patterns from a recursive http
download, but wget can't do it.

* The manpage says

           Specify comma-separated lists of file name suffixes or
           patterns to accept or reject.

I don't understand that "pattern", but it only does anything if the
filename ends in the specified string(s).

* --reject doesn't work if --html-extension is specified.

* --reject works by deleting the matching files after download. This
isn't much help to cut down on hundreds of MB of traffic, and deleting
files matching a pattern is much easier done with find xargs etc.

Volker

__

[EMAIL PROTECTED]
tel. 0486/238133
fax. 070 709185  
-- 
[EMAIL PROTECTED]
tel. 0486/238133
fax. 070 709185

RE: --reject issues

Reply via email to