Hi Scott, It’s a pretty good tool for that - it is a Web Crawler, which is used to discover the web graph of a domain or of the entire internet - from pages, to documents, to images, to other web resources.
Nutch crawls, identifies URLs, fetches them, parses, them and indexes them for search. It can do in a scalable fashion to grow with the size of what you are trying to discover. Does that help? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Scott Lundgren <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, March 30, 2015 at 5:56 AM To: "[email protected]" <[email protected]> Subject: website structure discovery? >If I want to crawl & learn the directory & information structure of a >website is nutch a good tool for this problem? >Would you recommend a different tool? > >Scott Lundgren >Software Engineer >(704) 973-7388 >[email protected]<mailto:[email protected]> > >QuietStream Financial, LLC<http://www.quietstreamfinancial.com> >11121 Carmel Commons Boulevard | Suite 250 >Charlotte, North Carolina 28226 > >Our Portfolio of Commercial Real Estate Solutions: >• <http://www.defeasewithease.com> Commercial >Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) >• Fairview Real Estate Solutions<http://www.fairviewres.com/> >• Great River Mortgage >Capital<http://www.greatrivermortgagecapital.com/> >• Tax Credit Asset Management<http://www.tcamre.com/> >• Radian Generation<http://www.radiangeneration.com/> >• EntityKeeper<http://www.entitykeeper.com/>™ >• Crowd With Ease<http://www.crowdwithease.com>™ >• FullCapitalStack<http://www.fullcapitalstack.com>™ >• CrowdRabbit<http://www.crowdrabbit.com>™ >

