Sorta. I’m using Nutch to crawl and index very specific areas of content on a test website resulting in a highly crafted url-regexfilter.tx file. The downside is a brittle process is a website redesign breaks the setup. It’s also a slow process that I have to do for each site and eventually I want to be crawling & indexing about several hundred specific sites. So I need a way to index and “onboard” a new site in an automated way.
So I’m wondering if Nutch is the best spider/tool to run through an entire site and the resulting output is a visual graph or text representation of the site’s directory/URL structure when a sitemap file is not available. Scott Lundgren Software Engineer (704) 973-7388 [email protected]<mailto:[email protected]> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™ On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) <[email protected]<mailto:[email protected]>> wrote: Hi Scott, It’s a pretty good tool for that - it is a Web Crawler, which is used to discover the web graph of a domain or of the entire internet - from pages, to documents, to images, to other web resources. Nutch crawls, identifies URLs, fetches them, parses, them and indexes them for search. It can do in a scalable fashion to grow with the size of what you are trying to discover. Does that help? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Scott Lundgren <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, March 30, 2015 at 5:56 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: website structure discovery? If I want to crawl & learn the directory & information structure of a website is nutch a good tool for this problem? Would you recommend a different tool? Scott Lundgren Software Engineer (704) 973-7388 [email protected]<mailto:[email protected]><mailto:[email protected]> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™

