Hello,

I am trying to mirror my website using wget. The site is driven by a
Wiki software and therefore contains a couple of publicly accessible
interactive pages which I want to exclude from the download, e.g.:

  http://www.daemon.de/podwiki.pl?page=ShorthandSandbox&state=checkout&revision=1.11

This link checks an old version of some (unprotected) page out and
overwrites the current revision of this page.

For the site itself, apache mod_rewrite rules are installed to make it
possible to access the content via normal urls, e.g.:

 http://www.daemon.de/ShorthandSandbox

Now I want wget to ignore every interactive page, I tried:

wget -vm --exclude-directories='*.pl*' http://www.daemon.de

But it still fetches interactive stuff:

--23:37:55--
http://www.daemon.de/podwiki.pl?page=PodWikiIndex&entry=AutoLoadPrint
           =>
`www.daemon.de/podwiki.pl?page=PodWikiIndex&entry=AutoLoadPrint'
Reusing connection to www.daemon.de:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]


I digged a little bit through the code and found, that somewhere it
calls the function "fnmatch()" to decide if a given url matches an
exclusion glob. Whatever really happens - I dont understand. But I tried
the mentioned function and it normally works as I would expect:
'*.pl*' matches 'podwiki.pl?page=PodWikiIndex'. 

IMHO this is a bug, seems that wget ignores the fnmatch() result anyway
for some reason.



kind regards, Tom

-- 
 Thomas Linden   (http://www.daemon.de/)  tom at co dot daemon dot de
 $_=`perl -v`;s;^.*ll;;s;$^=unpack"u", "'8V]D;')E<```";s;\W;;gs;$/=7*
 ($^=~s;.;;g);%^=map{$_=>1}split//,lc;$_=join$\, (sort keys(%^))[map{
 ord($_)-$/}split//,'[EMAIL PROTECTED]:7C1A7C=1:35<7C'];s"0(.)" \U$1"g;print;

Reply via email to