Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, March 30, 2015 at 5:56 AM
To: "[email protected]" <[email protected]>
Subject: website structure discovery?

>If I want to crawl & learn the directory & information structure of a
>website is nutch a good tool for this problem?
>Would you recommend a different tool?
>
>Scott Lundgren
>Software Engineer
>(704) 973-7388
>[email protected]<mailto:[email protected]>
>
>QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
>11121 Carmel Commons Boulevard | Suite 250
>Charlotte, North Carolina 28226
>
>Our Portfolio of Commercial Real Estate Solutions:
>•        <http://www.defeasewithease.com> Commercial
>Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
>•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
>•        Great River Mortgage
>Capital<http://www.greatrivermortgagecapital.com/>
>•        Tax Credit Asset Management<http://www.tcamre.com/>
>•        Radian Generation<http://www.radiangeneration.com/>
>•        EntityKeeper<http://www.entitykeeper.com/>™
>•        Crowd With Ease<http://www.crowdwithease.com>™
>•        FullCapitalStack<http://www.fullcapitalstack.com>™
>•        CrowdRabbit<http://www.crowdrabbit.com>™
>

Reply via email to