Re: Scraping an Entire Website

2015-02-24 Thread 'Chris Blouch' via MacVisionaries

I have macports set up on my machine

https://www.macports.org

so that makes it easy to add additional packages. With that I just do

sudo port install wget

enter my computer's password and it installs wget. Once you have wget 
you can crawl/archive an entire site with this command:


wget -mcrpk -o process.log http://www.somesite.com

Parameters:

O - output of the process is written to a file instead of the display. 
In this case I logged everything to process.log

M - Mirror - copies timestamps and recursion
C - Continues a partly-downloaded transfer. Probably not as big an issue 
on a functioning web site
R - Recursion, but this might not have been needed with the M. Figured 
it didn't hurt.

P - Download any page dependencies like CSS, images etc.
K - Convert all links to relative URLs so it doesn't keep trying to link 
off to the original site or path


That should be it. You'll end up with a folder with everything in it. 
You can turn on apache on your Mac and browse the site with Safari if 
you like. You can also open the files in Safari but some stuff like 
Ajaxed dynamic content won't work. You probably won't want to re-host it 
as is since some shared assets will be duplicated. Another caveat is 
that if you have secret URLs that are not linked from anywhere on your 
site the crawler will not be able to find them.


CB

On 2/22/15 6:54 PM, Sabahattin Gucukoglu wrote:

Yep, both httrack and wget are options for OS X, and they’re both command-line 
accessible.  I’d choose httrack first, as that’s generally better at this sort 
of thing, but wget will work also if the site is not too complex and/or you 
just want the static files.



--
¯\_(ツ)_/¯

--
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-23 Thread Esther
Hi Jeff,

In addition to using wget from the command line of Terminal, as Greg and Jon 
suggested, there's a Mac App Store app named SiteSucker that works through a 
GUI interface.  It's $4.99, but it used to be free up till a year ago, so I had 
the opportunity to try SiteSucker out with VoiceOver when I answered a question 
about web site downloaders a few years ago on the Mac-access list.  A quick run 
through of the current app did not show any accessibility issues (under the 
latest version of Mavericks).  In fact, at the time I first tried this app, the 
only change I made to my default VoiceOver settings was to change my Navigation 
to have Mouse pointer set to Follows VoiceOver cursor instead of Ignores 
VoiceOver Cursor to click Settings in the app's toolbar.  That's not 
necessary now,

Here's the Mac App Store URL for SiteSucker:
• SiteSucker by Rick Cranisky ($4.99) 
https://itunes.apple.com/app/sitesucker/id442168834?mt=12

The app is localized in English, French, German, Italian, Portuguese, and 
Spanish, but the SiteSucker Manual is only available in English, French, and 
Portuguese.  Here's the URL for the User Manual in English (which will display 
in Safari Reader with Command-Shift-r):
http://ricks-apps.com/osx/sitesucker/archive/2.x/2.6.x/2.6/manuals/en/index.html

Note that for the web site user guide, I route my VoiceOver cursor to each of 
the main links in the list of  three items (Overview, Settings, and 
Advanced Topics), with VO-Command-F5, and then VO-Right to navigate to the 
level 2 list of items under that link.

SiteSucker is also Apple Scriptable, with information and samples available 
from this URL:
http://ricks-apps.com/osx/sitesucker/scripts.html

The main comments are that there are a lot of customizable options, and you 
probably don't want to pull down every file on the site.  An old tip from the 
MacUpdate site suggested that you might want to pay attention to limiting the 
Settings  General  Path Constraint  pop-up menu options if you want to only 
pull down a subset of the files.

HTH

Esther

On Sunday, February 22, 2015 at 1:55:23 PM UTC-10, Sabahattin Gucukoglu wrote:
 Yep, both httrack and wget are options for OS X, and they’re both 
 command-line accessible.  I’d choose httrack first, as that’s generally 
 better at this sort of thing, but wget will work also if the site is not too 
 complex and/or you just want the static files.

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread BobH.
Not sure how to do that, but have to advise that you might not get an 
accurate copy if there is dynamic code from .php or .asp and other 
script-built pages.

Doesn't the service provider or host, have a lost password recovery process? 
my hosting uses a cPanel interface and know there is one in that.

RobH.

- Original Message - 
From: Jeff Berwick mailingli...@berwick.name
To: macvisionaries@googlegroups.com
Sent: Sunday, February 22, 2015 6:01 PM
Subject: Scraping an Entire Website


Hi there,

I have a customer who has lost their password to their old website, so I am 
building them a new one.  Are there any accessible programs out there to 
pull down an entire website, pictures and all?

Thx,
Jeff

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout. 

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread Aman Singer
Hello Jeff,

You might like to try HTTrack
http://www.httrack.com/
to the best of my knowledge, there is a command line version of this
software but I'm not sure whether it will work with Terminal
accessibly. I have no reason to think it won't, you understand, I
simply haven't tried it. The Windows version is a pain in the neck to
use, but the command line isn't too bad on Windows.
Aman

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread BobH.
Strikes me the provider wasn't trying too hard, for whatever reason.  I had 
one like that, just paid, then they screwed it and I took site and domain 
elsewhere, but lost a years service I'd paid for. You can recover domain 
names if you still have the Nominet details one gets when buying domain 
names.

RobH.
- Original Message - 
From: Jeff Berwick mailingli...@berwick.name
To: macvisionaries@googlegroups.com
Sent: Sunday, February 22, 2015 6:46 PM
Subject: Re: Scraping an Entire Website


Really, all I want is the images, so I'm okay if some dynamic text is lost.

I know that this client has been working with their provider and they 
haven't been able to recover the password.  As a result, we've had to set up 
a new domain name and whole new site.  He just doesn't want to lose the 
images on the old site, and he wants to incorporate them into the new 
design.

Jeff

 On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote:

 Not sure how to do that, but have to advise that you might not get an
 accurate copy if there is dynamic code from .php or .asp and other
 script-built pages.

 Doesn't the service provider or host, have a lost password recovery 
 process?
 my hosting uses a cPanel interface and know there is one in that.

 RobH.

 - Original Message - 
 From: Jeff Berwick mailingli...@berwick.name
 To: macvisionaries@googlegroups.com
 Sent: Sunday, February 22, 2015 6:01 PM
 Subject: Scraping an Entire Website


 Hi there,

 I have a customer who has lost their password to their old website, so I 
 am
 building them a new one.  Are there any accessible programs out there to
 pull down an entire website, pictures and all?

 Thx,
 Jeff

 -- 
 You received this message because you are subscribed to the Google Groups
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.

 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an 
email to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout. 

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread Jeff Berwick
Really, all I want is the images, so I'm okay if some dynamic text is lost.

I know that this client has been working with their provider and they haven't 
been able to recover the password.  As a result, we've had to set up a new 
domain name and whole new site.  He just doesn't want to lose the images on the 
old site, and he wants to incorporate them into the new design.

Jeff

 On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote:
 
 Not sure how to do that, but have to advise that you might not get an 
 accurate copy if there is dynamic code from .php or .asp and other 
 script-built pages.
 
 Doesn't the service provider or host, have a lost password recovery process? 
 my hosting uses a cPanel interface and know there is one in that.
 
 RobH.
 
 - Original Message - 
 From: Jeff Berwick mailingli...@berwick.name
 To: macvisionaries@googlegroups.com
 Sent: Sunday, February 22, 2015 6:01 PM
 Subject: Scraping an Entire Website
 
 
 Hi there,
 
 I have a customer who has lost their password to their old website, so I am 
 building them a new one.  Are there any accessible programs out there to 
 pull down an entire website, pictures and all?
 
 Thx,
 Jeff
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout. 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread Jonathan C Cohn
There used to be a version of wget that came with the macintosh for use on the 
terminal. The only close examples now seem to be perl scripts for downloading 
sites either of lap-request or lap-download.

I could not tell from a quick review if either of these does recursive 
downloads of a site.


 On Feb 22, 2015, at 14:23, BobH. long.c...@virgin.net wrote:
 
 Strikes me the provider wasn't trying too hard, for whatever reason.  I had 
 one like that, just paid, then they screwed it and I took site and domain 
 elsewhere, but lost a years service I'd paid for. You can recover domain 
 names if you still have the Nominet details one gets when buying domain 
 names.
 
 RobH.
 - Original Message - 
 From: Jeff Berwick mailingli...@berwick.name
 To: macvisionaries@googlegroups.com
 Sent: Sunday, February 22, 2015 6:46 PM
 Subject: Re: Scraping an Entire Website
 
 
 Really, all I want is the images, so I'm okay if some dynamic text is lost.
 
 I know that this client has been working with their provider and they 
 haven't been able to recover the password.  As a result, we've had to set up 
 a new domain name and whole new site.  He just doesn't want to lose the 
 images on the old site, and he wants to incorporate them into the new 
 design.
 
 Jeff
 
 On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote:
 
 Not sure how to do that, but have to advise that you might not get an
 accurate copy if there is dynamic code from .php or .asp and other
 script-built pages.
 
 Doesn't the service provider or host, have a lost password recovery 
 process?
 my hosting uses a cPanel interface and know there is one in that.
 
 RobH.
 
 - Original Message - 
 From: Jeff Berwick mailingli...@berwick.name
 To: macvisionaries@googlegroups.com
 Sent: Sunday, February 22, 2015 6:01 PM
 Subject: Scraping an Entire Website
 
 
 Hi there,
 
 I have a customer who has lost their password to their old website, so I 
 am
 building them a new one.  Are there any accessible programs out there to
 pull down an entire website, pictures and all?
 
 Thx,
 Jeff
 
 -- 
 You received this message because you are subscribed to the Google Groups
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout. 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 MacVisionaries group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to macvisionaries+unsubscr...@googlegroups.com.
 To post to this group, send email to macvisionaries@googlegroups.com.
 Visit this group at http://groups.google.com/group/macvisionaries.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread gkearney
What you want is DeepVacume http://www.hexcat.com/deepvacuum/index.html
 Its a GUI version of wget that will do what you want.


On Sunday, 22 February 2015 10:01:13 UTC-8, jberwick wrote:

 Hi there, 

 I have a customer who has lost their password to their old website, so I 
 am building them a new one.  Are there any accessible programs out there to 
 pull down an entire website, pictures and all? 

 Thx, 
 Jeff 



-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.


Re: Scraping an Entire Website

2015-02-22 Thread Sabahattin Gucukoglu
Yep, both httrack and wget are options for OS X, and they’re both command-line 
accessible.  I’d choose httrack first, as that’s generally better at this sort 
of thing, but wget will work also if the site is not too complex and/or you 
just want the static files.

-- 
You received this message because you are subscribed to the Google Groups 
MacVisionaries group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to macvisionaries+unsubscr...@googlegroups.com.
To post to this group, send email to macvisionaries@googlegroups.com.
Visit this group at http://groups.google.com/group/macvisionaries.
For more options, visit https://groups.google.com/d/optout.