Re: [SLUG] Spider a website

2008-06-02 Thread Ycros
You could use wget to do this, it's installed on most distributions by  
default.


Usually you'd run it like this: wget --mirror -np http://some.url/
(the -np tells it not to recurse up to the parent, which is useful if  
you only want to mirror a subdirectory. I add it on out of habit.)


It's not always perfect however, as it can sometimes mess the URLs up,  
but it's worth a try anyway.


On 03/06/2008, at 2:20 PM, Peter Rundle wrote:

I'm looking for some recommendations for a *simple* Linux based tool  
to spider a web site and pull the content back into plain html  
files, images, js, css etc.


I have a site written in PHP which needs to be hosted temporarily on  
a server which is incapable (read only does static content). This is  
not a problem from a temp presentation point of view as the default  
values for each page will suffice. So I'm just looking for a tool  
which will quickly pull the real site (on my home php capable  
server) into a directory that I can zip and send to the internet  
addressable server.


I know there's a lot of code out there, I'm asking for  
recommendations.


TIA's

Pete

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Richard Heycock
Excerpts from Peter Rundle's message of Tue Jun 03 14:20:08 +1000 2008:
 I'm looking for some recommendations for a *simple* Linux based tool to spider
 a web site and pull the content back into 
 plain html files, images, js, css etc.
 
 I have a site written in PHP which needs to be hosted temporarily on a server
 which is incapable (read only does static 
 content). This is not a problem from a temp presentation point of view as the
 default values for each page will suffice. 
 So I'm just looking for a tool which will quickly pull the real site (on my
 home php capable server) into a directory 
 that I can zip and send to the internet addressable server.
 
 I know there's a lot of code out there, I'm asking for recommendations.

wget can do that. Use the recurse option.

rgh

 TIA's
 
 Pete
 

-- 
+61 (0) 410 646 369
[EMAIL PROTECTED]

You're worried criminals will continue to penetrate into cyberspace, and
I'm worried complexity, poor design and mismanagement will be there to meet
them - Marcus Ranum
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Jonathan Lange
On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle [EMAIL PROTECTED] wrote:
 I'm looking for some recommendations for a *simple* Linux based tool to
 spider a web site and pull the content back into plain html files, images,
 js, css etc.

 I have a site written in PHP which needs to be hosted temporarily on a
 server which is incapable (read only does static content). This is not a
 problem from a temp presentation point of view as the default values for
 each page will suffice. So I'm just looking for a tool which will quickly
 pull the real site (on my home php capable server) into a directory that I
 can zip and send to the internet addressable server.

 I know there's a lot of code out there, I'm asking for recommendations.


I'd use 'wget'. From what you describe, 'wget -r' should be very close
to what you want. Consult the manpage for details about fiddling with
links etc.

jml
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Robert Collins
On Tue, 2008-06-03 at 14:20 +1000, Peter Rundle wrote:
 I'm looking for some recommendations for a *simple* Linux based tool to 
 spider a web site and pull the content back into 
 plain html files, images, js, css etc.
 
 I have a site written in PHP which needs to be hosted temporarily on a server 
 which is incapable (read only does static 
 content). This is not a problem from a temp presentation point of view as the 
 default values for each page will suffice. 
 So I'm just looking for a tool which will quickly pull the real site (on my 
 home php capable server) into a directory 
 that I can zip and send to the internet addressable server.
 
 I know there's a lot of code out there, I'm asking for recommendations.

wget :)

-Rob
-- 
GPG key available at: http://www.robertcollins.net/keys.txt.


signature.asc
Description: This is a digitally signed message part
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

[SLUG] Spider a website

2008-06-02 Thread Peter Rundle
I'm looking for some recommendations for a *simple* Linux based tool to spider a web site and pull the content back into 
plain html files, images, js, css etc.


I have a site written in PHP which needs to be hosted temporarily on a server which is incapable (read only does static 
content). This is not a problem from a temp presentation point of view as the default values for each page will suffice. 
So I'm just looking for a tool which will quickly pull the real site (on my home php capable server) into a directory 
that I can zip and send to the internet addressable server.


I know there's a lot of code out there, I'm asking for recommendations.

TIA's

Pete

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Ycros

On 03/06/2008, at 3:19 PM, Mary Gardiner wrote:


On Tue, Jun 03, 2008, Ycros wrote:
It's not always perfect however, as it can sometimes mess the URLs  
up,

but it's worth a try anyway.


The -k option to convert any absolute paths to relative ones can be
helpful with this (depending on what you meant by mess the URLs up).


I think it was URLs in stylesheets and in javascript (well, there's  
not much you can do with the javascript really)

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Mary Gardiner
On Tue, Jun 03, 2008, Ycros wrote:
 It's not always perfect however, as it can sometimes mess the URLs up,  
 but it's worth a try anyway.

The -k option to convert any absolute paths to relative ones can be
helpful with this (depending on what you meant by mess the URLs up).

-Mary
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread Daniel Pittman
Peter Rundle [EMAIL PROTECTED] writes:

 I'm looking for some recommendations for a *simple* Linux based tool
 to spider a web site and pull the content back into plain html files,
 images, js, css etc.

Others have suggested wget, which works very well.  You might also
consider 'puf':

Package: puf
Priority: optional
Section: universe/web
Description: Parallel URL fetcher
 puf is a download tool for UNIX-like systems. You may use it to download
 single files or to mirror entire servers. It is similar to GNU wget
 (and has a partly compatible command line), but has the ability to do
 many downloads in parallel. This is very interesting, if you have a
 high-bandwidth internet connection.

This works quite well when, as it notes, presented with sufficient
bandwidth (and server resources) to have multiple links fetched at once.

Regards,
Daniel
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html


Re: [SLUG] Spider a website

2008-06-02 Thread James Polley
wget-smubble-yew-get. Wget works great for getting a single file or a very
simple all-under-this-tree setup, but it can take forever.

Try httrack - http://www.httrack.com/. Ignore the pretty little screenshots,
the linux commandline version does the same job, just requires much
command-line-fu. It handles simple javascript links, is intelligent about
fetching requisites (images, css etc) from off-domain without trying to
cache the whole internet, is multi-threaded - and is actually designed
specifically for the purpose of making a static, offline copy of a website.

The user's guide at http://www.httrack.com/html/fcguide.html goes through
most common scenarios for you, and $DISTRO should be able to apt-get install
it for you. Urrr.. or whatever broken tool distros unfortunate enough not to
have apt-get use.

On Tue, Jun 3, 2008 at 2:20 PM, Peter Rundle [EMAIL PROTECTED]
wrote:

 I'm looking for some recommendations for a *simple* Linux based tool to
 spider a web site and pull the content back into plain html files, images,
 js, css etc.

 I have a site written in PHP which needs to be hosted temporarily on a
 server which is incapable (read only does static content). This is not a
 problem from a temp presentation point of view as the default values for
 each page will suffice. So I'm just looking for a tool which will quickly
 pull the real site (on my home php capable server) into a directory that I
 can zip and send to the internet addressable server.

 I know there's a lot of code out there, I'm asking for recommendations.

 TIA's

 Pete

 --
 SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
 Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html




-- 
There is nothing more worthy of contempt than a man who quotes himself -
Zhasper, 2004
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html