Hi there,
I got the same problems as Volker, so I took a look into the source
code:
line 527 in cvs src/recur.c checks if it is necessary to download the file :
"if (!acceptable (u->file))"
This only checks the first part of the filename, e.g. :
on the following url:
domin.studentenweb.org/user.php?op=register&module=NS-NewUser
it will ONLY try to match "user.php"
While on a whole lot of sites, the things coming after .php determine the
content.
This is inconsistent, because after it downloaded the file (e.g. if you used
(--reject "*register*"), it will check the WHOLE filename
("user.php?op=register&module=NS-NewUser"), and it will be deleted after it
got downloaded after all (like Volker says).
this is done on line 366
"if (opt.delete_after || (file && !acceptable (file)))"
I suggest 2 things
1)
change the line 527 in
"if (!acceptable (u->url))"
(or something better, but this worked for me & will match on
"user.php?op=register&module=NS-NewUser" )
2)and (since we're on the topic): offer a possibility to even reject html
files that match the reject clause, because that could be also very
convenient (I would even make it the default)
greets,
Dominique De Munck
a small diff :
-527c527
-< if (!acceptable (u->url))
----
-> if (!acceptable (u->file))
a bigger one :
diff --unified --recursive --new-file wgetAdapted/src/patch wget/src/patch
--- wgetAdapted/src/patch 2003-08-11 08:56:41.000000000 +0200
+++ wget/src/patch 1970-01-01 01:00:00.000000000 +0100
@@ -1,4 +0,0 @@
-527c527
-< if (!acceptable (u->url))
----
-> if (!acceptable (u->file))
diff --unified --recursive --new-file wgetAdapted/src/recur.c wget/src/recur.c
--- wgetAdapted/src/recur.c 2003-08-11 09:18:01.000000000 +0200
+++ wget/src/recur.c 2002-07-24 23:16:30.000000000 +0200
@@ -524,7 +524,7 @@
&& depth != INFINITE_RECURSION
&& depth < opt.reclevel - 1))
{
- if (!acceptable (u->url))
+ if (!acceptable (u->file))
{
DEBUGP (("%s (%s) does not match acc/rej rules.\n",
url, u->file));
ORIGINAL MESSAGE :
-- I am trying to exclude certain file name patterns from a recursive http
download, but wget can't do it.
* The manpage says
Specify comma-separated lists of file name suffixes or
patterns to accept or reject.
I don't understand that "pattern", but it only does anything if the
filename ends in the specified string(s).
* --reject doesn't work if --html-extension is specified.
* --reject works by deleting the matching files after download. This
isn't much help to cut down on hundreds of MB of traffic, and deleting
files matching a pattern is much easier done with find xargs etc.
Volker
__
[EMAIL PROTECTED]
tel. 0486/238133
fax. 070 709185
--
[EMAIL PROTECTED]
tel. 0486/238133
fax. 070 709185