Sorta. I’m using Nutch to crawl and index very specific areas of content on a 
test website resulting in a highly crafted url-regexfilter.tx file. The 
downside is a brittle process is a website redesign breaks the setup. It’s also 
a slow process that I have to do for each site and eventually I want to be 
crawling & indexing about several hundred specific sites. So I need a way to 
index and “onboard” a new site in an automated way.

So I’m wondering if Nutch is the best spider/tool to run through an entire site 
and the resulting output is a visual graph or text representation of the site’s 
directory/URL structure when a sitemap file is not available.

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial 
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) 
<[email protected]<mailto:[email protected]>> wrote:

Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]<mailto:[email protected]>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, March 30, 2015 at 5:56 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: website structure discovery?

If I want to crawl & learn the directory & information structure of a
website is nutch a good tool for this problem?
Would you recommend a different tool?

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]><mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage
Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™



Reply via email to