Re: comments on using wget

Micah Cowan Mon, 11 Feb 2008 20:34:21 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anirban Banerjee wrote:
> Hello everyone,
>                       I am going to start a small project to analyze
> how 8 websites are connected to each other and would be grateful for
> any comments which people might have regarding the methodology I plan
> to follow.
> 
> The problem: 8 websites with hyperlinks, images, js, etc.. embedded in
> them. I need to find how they are referencing each other. In general I
> have to know where do the links in the HTML code of these sites point
> to in the Internet.


Sounds good.

Note that, of course, Wget doesn't process JavaScript, so any JavaScript
that could result in page accesses won't count.

Also, unfortunately, Wget does not currently comprehend links within CSS
(though it comprehends links _to_ CSS, of course). This means that, for
instance, CSS documents which are only @imported from other CSS
documents, rather than from within HTML; and images that are specified
as backgrounds from within CSS, or as list-item decorations, are also
not counted (we are actively working on improving this situation for
Wget 1.12).

> What I plan to do: I will use wget as a crawler (I like command
> line!!) and extract the needed information. I will set the  command
> line params so that wget will visit 2 level deep sites and extract
> them too.
> 
> What exact info do I need: (a) The hyperlinks on each site (b) the
> image file(s) (c) any binary stored on the sites (d) the actual HTML
> code of the main page.

All sounds good, subject to the caveats given above.

You probably know this already, but as it's not always obvious to
everyone, I should point out that while Wget is certainly capable of
grabbing "the actual HTML code" of all web pages, it is _not_ generally
capable of grabbing underlying PHP, CGI, ASP, CFM, etc, code: it is
limited to no more than what your normal browser can see (essentially,
what's visible from your browser's "View Source" function).

> A few questions:
> I plan to use
> 
> nirvana$>wget -i input-urls -o logfile -x ouput-dir/ --random-wait 1.5
> -U "put-in-mozilla-user-string" -r -l 2 -p "URL of a site"
> 
> I have not mentioned using noclobber/timeout/retries options but will
> probably use them.
> 
> Using -x in the params tell wget to store all data from sitea.com
> <http://sitea.com> to
> output-dir/sitea.com/ , right?

Yes, but that's already the default, actually, when you specify -r.
Note that -x doesn't take an argument (you probably wanted -P).

Why -l 2? That's liable to get you a pretty shallow view of the site.

If you're looking for interactions across multiple sites, you probably
want -H, perhaps along with -D <list of allowed domains>.

> Is it a good idea to modify the standard mozilla user string to
> include my name and email, so that the admin of the site can contact
> me in case he does not like what I am doing.

It can't hurt. That's far from widespread practice, though. But, if the
admin _does_ decide phe doesn't like what you're doing, you'll at least
hopefully have earned a fair degree of respect for such courtesy. :)

> The major problem which I forsee is: As I am using recursive downloads
> some of the sites which will be downloaded may have very large files
> in their directories. I went through the wget manpage but could not
> find an option to set the reject list based on size. Type yes but not
> size. There was another post on groups.google which answered this
> question, bu the solution was to download the data and then "not
> analyze" it if it is greater than say, 10 MB.. I need to stop wget
> from downloading it in the first place.

Yeah, unfortunately Wget doesn't currently offer such an option, though
there is currently an issue filed for that, targeted for 1.13:
https://savannah.gnu.org/bugs/index.php?20483.

Note, too, that as far as documentation goes, the TexInfo manual ("info
wget") contains significantly more information than the manpage does.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHsSG07M8hyUobTrERAhCBAJ9QPOjeqmh1BtlG/wy6wjYSkWx1MACfe7XK
h7EjoomR7EngIfcEQZC4QOg=
=dECu
-----END PGP SIGNATURE-----

Re: comments on using wget

Reply via email to