-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anirban Banerjee wrote: > Hello everyone, > I am going to start a small project to analyze > how 8 websites are connected to each other and would be grateful for > any comments which people might have regarding the methodology I plan > to follow. > > The problem: 8 websites with hyperlinks, images, js, etc.. embedded in > them. I need to find how they are referencing each other. In general I > have to know where do the links in the HTML code of these sites point > to in the Internet.
Sounds good. Note that, of course, Wget doesn't process JavaScript, so any JavaScript that could result in page accesses won't count. Also, unfortunately, Wget does not currently comprehend links within CSS (though it comprehends links _to_ CSS, of course). This means that, for instance, CSS documents which are only @imported from other CSS documents, rather than from within HTML; and images that are specified as backgrounds from within CSS, or as list-item decorations, are also not counted (we are actively working on improving this situation for Wget 1.12). > What I plan to do: I will use wget as a crawler (I like command > line!!) and extract the needed information. I will set the command > line params so that wget will visit 2 level deep sites and extract > them too. > > What exact info do I need: (a) The hyperlinks on each site (b) the > image file(s) (c) any binary stored on the sites (d) the actual HTML > code of the main page. All sounds good, subject to the caveats given above. You probably know this already, but as it's not always obvious to everyone, I should point out that while Wget is certainly capable of grabbing "the actual HTML code" of all web pages, it is _not_ generally capable of grabbing underlying PHP, CGI, ASP, CFM, etc, code: it is limited to no more than what your normal browser can see (essentially, what's visible from your browser's "View Source" function). > A few questions: > I plan to use > > nirvana$>wget -i input-urls -o logfile -x ouput-dir/ --random-wait 1.5 > -U "put-in-mozilla-user-string" -r -l 2 -p "URL of a site" > > I have not mentioned using noclobber/timeout/retries options but will > probably use them. > > Using -x in the params tell wget to store all data from sitea.com > <http://sitea.com> to > output-dir/sitea.com/ , right? Yes, but that's already the default, actually, when you specify -r. Note that -x doesn't take an argument (you probably wanted -P). Why -l 2? That's liable to get you a pretty shallow view of the site. If you're looking for interactions across multiple sites, you probably want -H, perhaps along with -D <list of allowed domains>. > Is it a good idea to modify the standard mozilla user string to > include my name and email, so that the admin of the site can contact > me in case he does not like what I am doing. It can't hurt. That's far from widespread practice, though. But, if the admin _does_ decide phe doesn't like what you're doing, you'll at least hopefully have earned a fair degree of respect for such courtesy. :) > The major problem which I forsee is: As I am using recursive downloads > some of the sites which will be downloaded may have very large files > in their directories. I went through the wget manpage but could not > find an option to set the reject list based on size. Type yes but not > size. There was another post on groups.google which answered this > question, bu the solution was to download the data and then "not > analyze" it if it is greater than say, 10 MB.. I need to stop wget > from downloading it in the first place. Yeah, unfortunately Wget doesn't currently offer such an option, though there is currently an issue filed for that, targeted for 1.13: https://savannah.gnu.org/bugs/index.php?20483. Note, too, that as far as documentation goes, the TexInfo manual ("info wget") contains significantly more information than the manpage does. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHsSG07M8hyUobTrERAhCBAJ9QPOjeqmh1BtlG/wy6wjYSkWx1MACfe7XK h7EjoomR7EngIfcEQZC4QOg= =dECu -----END PGP SIGNATURE-----
