Re: Scraping an Entire Website
I have macports set up on my machine https://www.macports.org so that makes it easy to add additional packages. With that I just do sudo port install wget enter my computer's password and it installs wget. Once you have wget you can crawl/archive an entire site with this command: wget -mcrpk -o process.log http://www.somesite.com Parameters: O - output of the process is written to a file instead of the display. In this case I logged everything to process.log M - Mirror - copies timestamps and recursion C - Continues a partly-downloaded transfer. Probably not as big an issue on a functioning web site R - Recursion, but this might not have been needed with the M. Figured it didn't hurt. P - Download any page dependencies like CSS, images etc. K - Convert all links to relative URLs so it doesn't keep trying to link off to the original site or path That should be it. You'll end up with a folder with everything in it. You can turn on apache on your Mac and browse the site with Safari if you like. You can also open the files in Safari but some stuff like Ajaxed dynamic content won't work. You probably won't want to re-host it as is since some shared assets will be duplicated. Another caveat is that if you have secret URLs that are not linked from anywhere on your site the crawler will not be able to find them. CB On 2/22/15 6:54 PM, Sabahattin Gucukoglu wrote: Yep, both httrack and wget are options for OS X, and they’re both command-line accessible. I’d choose httrack first, as that’s generally better at this sort of thing, but wget will work also if the site is not too complex and/or you just want the static files. -- ¯\_(ツ)_/¯ -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Hi Jeff, In addition to using wget from the command line of Terminal, as Greg and Jon suggested, there's a Mac App Store app named SiteSucker that works through a GUI interface. It's $4.99, but it used to be free up till a year ago, so I had the opportunity to try SiteSucker out with VoiceOver when I answered a question about web site downloaders a few years ago on the Mac-access list. A quick run through of the current app did not show any accessibility issues (under the latest version of Mavericks). In fact, at the time I first tried this app, the only change I made to my default VoiceOver settings was to change my Navigation to have Mouse pointer set to Follows VoiceOver cursor instead of Ignores VoiceOver Cursor to click Settings in the app's toolbar. That's not necessary now, Here's the Mac App Store URL for SiteSucker: • SiteSucker by Rick Cranisky ($4.99) https://itunes.apple.com/app/sitesucker/id442168834?mt=12 The app is localized in English, French, German, Italian, Portuguese, and Spanish, but the SiteSucker Manual is only available in English, French, and Portuguese. Here's the URL for the User Manual in English (which will display in Safari Reader with Command-Shift-r): http://ricks-apps.com/osx/sitesucker/archive/2.x/2.6.x/2.6/manuals/en/index.html Note that for the web site user guide, I route my VoiceOver cursor to each of the main links in the list of three items (Overview, Settings, and Advanced Topics), with VO-Command-F5, and then VO-Right to navigate to the level 2 list of items under that link. SiteSucker is also Apple Scriptable, with information and samples available from this URL: http://ricks-apps.com/osx/sitesucker/scripts.html The main comments are that there are a lot of customizable options, and you probably don't want to pull down every file on the site. An old tip from the MacUpdate site suggested that you might want to pay attention to limiting the Settings General Path Constraint pop-up menu options if you want to only pull down a subset of the files. HTH Esther On Sunday, February 22, 2015 at 1:55:23 PM UTC-10, Sabahattin Gucukoglu wrote: Yep, both httrack and wget are options for OS X, and they’re both command-line accessible. I’d choose httrack first, as that’s generally better at this sort of thing, but wget will work also if the site is not too complex and/or you just want the static files. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Not sure how to do that, but have to advise that you might not get an accurate copy if there is dynamic code from .php or .asp and other script-built pages. Doesn't the service provider or host, have a lost password recovery process? my hosting uses a cPanel interface and know there is one in that. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:01 PM Subject: Scraping an Entire Website Hi there, I have a customer who has lost their password to their old website, so I am building them a new one. Are there any accessible programs out there to pull down an entire website, pictures and all? Thx, Jeff -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Hello Jeff, You might like to try HTTrack http://www.httrack.com/ to the best of my knowledge, there is a command line version of this software but I'm not sure whether it will work with Terminal accessibly. I have no reason to think it won't, you understand, I simply haven't tried it. The Windows version is a pain in the neck to use, but the command line isn't too bad on Windows. Aman -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Strikes me the provider wasn't trying too hard, for whatever reason. I had one like that, just paid, then they screwed it and I took site and domain elsewhere, but lost a years service I'd paid for. You can recover domain names if you still have the Nominet details one gets when buying domain names. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:46 PM Subject: Re: Scraping an Entire Website Really, all I want is the images, so I'm okay if some dynamic text is lost. I know that this client has been working with their provider and they haven't been able to recover the password. As a result, we've had to set up a new domain name and whole new site. He just doesn't want to lose the images on the old site, and he wants to incorporate them into the new design. Jeff On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote: Not sure how to do that, but have to advise that you might not get an accurate copy if there is dynamic code from .php or .asp and other script-built pages. Doesn't the service provider or host, have a lost password recovery process? my hosting uses a cPanel interface and know there is one in that. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:01 PM Subject: Scraping an Entire Website Hi there, I have a customer who has lost their password to their old website, so I am building them a new one. Are there any accessible programs out there to pull down an entire website, pictures and all? Thx, Jeff -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Really, all I want is the images, so I'm okay if some dynamic text is lost. I know that this client has been working with their provider and they haven't been able to recover the password. As a result, we've had to set up a new domain name and whole new site. He just doesn't want to lose the images on the old site, and he wants to incorporate them into the new design. Jeff On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote: Not sure how to do that, but have to advise that you might not get an accurate copy if there is dynamic code from .php or .asp and other script-built pages. Doesn't the service provider or host, have a lost password recovery process? my hosting uses a cPanel interface and know there is one in that. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:01 PM Subject: Scraping an Entire Website Hi there, I have a customer who has lost their password to their old website, so I am building them a new one. Are there any accessible programs out there to pull down an entire website, pictures and all? Thx, Jeff -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
There used to be a version of wget that came with the macintosh for use on the terminal. The only close examples now seem to be perl scripts for downloading sites either of lap-request or lap-download. I could not tell from a quick review if either of these does recursive downloads of a site. On Feb 22, 2015, at 14:23, BobH. long.c...@virgin.net wrote: Strikes me the provider wasn't trying too hard, for whatever reason. I had one like that, just paid, then they screwed it and I took site and domain elsewhere, but lost a years service I'd paid for. You can recover domain names if you still have the Nominet details one gets when buying domain names. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:46 PM Subject: Re: Scraping an Entire Website Really, all I want is the images, so I'm okay if some dynamic text is lost. I know that this client has been working with their provider and they haven't been able to recover the password. As a result, we've had to set up a new domain name and whole new site. He just doesn't want to lose the images on the old site, and he wants to incorporate them into the new design. Jeff On Feb 22, 2015, at 1:41 PM, BobH. long.c...@virgin.net wrote: Not sure how to do that, but have to advise that you might not get an accurate copy if there is dynamic code from .php or .asp and other script-built pages. Doesn't the service provider or host, have a lost password recovery process? my hosting uses a cPanel interface and know there is one in that. RobH. - Original Message - From: Jeff Berwick mailingli...@berwick.name To: macvisionaries@googlegroups.com Sent: Sunday, February 22, 2015 6:01 PM Subject: Scraping an Entire Website Hi there, I have a customer who has lost their password to their old website, so I am building them a new one. Are there any accessible programs out there to pull down an entire website, pictures and all? Thx, Jeff -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
What you want is DeepVacume http://www.hexcat.com/deepvacuum/index.html Its a GUI version of wget that will do what you want. On Sunday, 22 February 2015 10:01:13 UTC-8, jberwick wrote: Hi there, I have a customer who has lost their password to their old website, so I am building them a new one. Are there any accessible programs out there to pull down an entire website, pictures and all? Thx, Jeff -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.
Re: Scraping an Entire Website
Yep, both httrack and wget are options for OS X, and they’re both command-line accessible. I’d choose httrack first, as that’s generally better at this sort of thing, but wget will work also if the site is not too complex and/or you just want the static files. -- You received this message because you are subscribed to the Google Groups MacVisionaries group. To unsubscribe from this group and stop receiving emails from it, send an email to macvisionaries+unsubscr...@googlegroups.com. To post to this group, send email to macvisionaries@googlegroups.com. Visit this group at http://groups.google.com/group/macvisionaries. For more options, visit https://groups.google.com/d/optout.