Re: [SLUG] Spider a website
You could use wget to do this, it's installed on most distributions by default. Usually you'd run it like this: wget --mirror -np http://some.url/ (the -np tells it not to recurse up to the parent, which is useful if you only want to mirror a subdirectory. I add it on out of habit.) It's not always perfect however, as it can sometimes mess the URLs up, but it's worth a try anyway. On 03/06/2008, at 2:20 PM, Peter Rundle wrote: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. TIA's Pete -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
Excerpts from Peter Rundle's message of Tue Jun 03 14:20:08 +1000 2008: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. wget can do that. Use the recurse option. rgh TIA's Pete -- +61 (0) 410 646 369 [EMAIL PROTECTED] You're worried criminals will continue to penetrate into cyberspace, and I'm worried complexity, poor design and mismanagement will be there to meet them - Marcus Ranum -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle [EMAIL PROTECTED] wrote: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. I'd use 'wget'. From what you describe, 'wget -r' should be very close to what you want. Consult the manpage for details about fiddling with links etc. jml -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
On Tue, 2008-06-03 at 14:20 +1000, Peter Rundle wrote: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. wget :) -Rob -- GPG key available at: http://www.robertcollins.net/keys.txt. signature.asc Description: This is a digitally signed message part -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
[SLUG] Spider a website
I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. TIA's Pete -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
On 03/06/2008, at 3:19 PM, Mary Gardiner wrote: On Tue, Jun 03, 2008, Ycros wrote: It's not always perfect however, as it can sometimes mess the URLs up, but it's worth a try anyway. The -k option to convert any absolute paths to relative ones can be helpful with this (depending on what you meant by mess the URLs up). I think it was URLs in stylesheets and in javascript (well, there's not much you can do with the javascript really) -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
On Tue, Jun 03, 2008, Ycros wrote: It's not always perfect however, as it can sometimes mess the URLs up, but it's worth a try anyway. The -k option to convert any absolute paths to relative ones can be helpful with this (depending on what you meant by mess the URLs up). -Mary -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
Peter Rundle [EMAIL PROTECTED] writes: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. Others have suggested wget, which works very well. You might also consider 'puf': Package: puf Priority: optional Section: universe/web Description: Parallel URL fetcher puf is a download tool for UNIX-like systems. You may use it to download single files or to mirror entire servers. It is similar to GNU wget (and has a partly compatible command line), but has the ability to do many downloads in parallel. This is very interesting, if you have a high-bandwidth internet connection. This works quite well when, as it notes, presented with sufficient bandwidth (and server resources) to have multiple links fetched at once. Regards, Daniel -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
Re: [SLUG] Spider a website
wget-smubble-yew-get. Wget works great for getting a single file or a very simple all-under-this-tree setup, but it can take forever. Try httrack - http://www.httrack.com/. Ignore the pretty little screenshots, the linux commandline version does the same job, just requires much command-line-fu. It handles simple javascript links, is intelligent about fetching requisites (images, css etc) from off-domain without trying to cache the whole internet, is multi-threaded - and is actually designed specifically for the purpose of making a static, offline copy of a website. The user's guide at http://www.httrack.com/html/fcguide.html goes through most common scenarios for you, and $DISTRO should be able to apt-get install it for you. Urrr.. or whatever broken tool distros unfortunate enough not to have apt-get use. On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle [EMAIL PROTECTED] wrote: I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into plain html files, images, js, css etc. I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory that I can zip and send to the internet addressable server. I know there's a lot of code out there, I'm asking for recommendations. TIA's Pete -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html -- There is nothing more worthy of contempt than a man who quotes himself - Zhasper, 2004 -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html