subject:"wget and wiki crawling"

wget and wiki crawling

2008-08-22 Thread asm c

Greetings,

Saw the address to this mailing list on the IRC topic  motd, so I thought 
asking here might help. Please CC any replies to me.

I've recently been using wget, and got it working for the most part, but 
there's one issue that's really been bugging me. One of the parameters I use is 
'-R *action=*,*oldid=*' (side note on the platform: ZSH on NetBSD on the SDF 
public access unix system, although I've also used it on windows with the same 
result). The purpose of this parameter is so that, when wget crawls a mid-sized 
wiki I'd like to have a local copy of, it doesn't bother with all the history 
pages, edit pages, and so forth. Not downloading these would save me an 
enormous amount of time. Unfortunately, the parameter is ignored until after 
the php page is downloaded. So, because it waits until it's downloaded to 
delete it, using the param doesn't really help at all.

Does anyone know how I can stop wget from even downloading matching pages?

Thanks.

Re: wget and wiki crawling

2008-08-22 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

asm c wrote:
 I've recently been using wget, and got it working for the most part, but
 there's one issue that's really been bugging me. One of the parameters I
 use is '-R *action=*,*oldid=*' (side note on the platform: ZSH on
 NetBSD on the SDF public access unix system, although I've also used it
 on windows with the same result). The purpose of this parameter is so
 that, when wget crawls a mid-sized wiki I'd like to have a local copy
 of, it doesn't bother with all the history pages, edit pages, and so
 forth. Not downloading these would save me an enormous amount of time.
 Unfortunately, the parameter is ignored until after the php page is
 downloaded. So, because it waits until it's downloaded to delete it,
 using the param doesn't really help at all.
 
 Does anyone know how I can stop wget from even downloading matching pages?

Well, you don't mention it, but I'll assume that those patterns occur in
the query string portion of the URL: that is, they follow a question
mark (?) that appears at some point.

Unfortunately, the -R and -A options only apply to the filename
portion of the URL: that is, whatever falls between the first question
mark, and the first preceding slash (/). Confusingly, it is also then
applied _after_ files are downloaded, to determine whether they should
be deleted after the fact: so Wget probably downloads those files you
really wish it wouldn't, and then deletes them afterwards anyway.

Worse, there's no way around this, currently. This is part of a suite of
problems that are currently slated to be addressed soon. The most
pertinent to your problem, though, is the need for a way to match
against query strings. I'm very much hoping to get around to this before
the next major Wget release, version 1.12. It's being tracked here:

https://savannah.gnu.org/bugs/index.php?22089

If you add yourself to the Cc list, you'll be able to follow along on
its progress.

- --
Cheers!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIr55d7M8hyUobTrERAu4KAJsHmDTZ46ioEGOTprdE/aTGrj853QCfet84
+c+npJnPwC/86/rLpn5rB8s=
=abdv
-END PGP SIGNATURE-

wget and wiki crawling

Re: wget and wiki crawling

2 matches

Site Navigation

Mail list logo

Footer information