What are you using url-regexfilter.txt for? What is your goal? crawl only the websites of your interest? meaning not "leaving" your seed URLs? If the website design changes as long as the URLs are the same this shouldn't be such a big deal.
By default Nutch doesn't index the link the structure (inlinks & outlinks) of each page, you can use [1] which will allow you to store this information in Solr/ES, although this only works for Nutch 1.x, after this you can write some small application that will generate what you want, for instance I've used [1] and d3.js to create some simple graphs about the link structure of the crawled sites, this is not exactly what you want but can be a starting point. I think that a sitemap generator shouldn't be too hard to create from the indexed inlinks & outlinks, or pulling the data directly out of Nutch stored info. [1] https://github.com/jorgelbg/links-extractor ----- Original Message ----- From: "Scott Lundgren" <[email protected]> To: [email protected] Sent: Monday, March 30, 2015 12:48:40 PM Subject: [MASSMAIL]Re: website structure discovery? Sorta. I’m using Nutch to crawl and index very specific areas of content on a test website resulting in a highly crafted url-regexfilter.tx file. The downside is a brittle process is a website redesign breaks the setup. It’s also a slow process that I have to do for each site and eventually I want to be crawling & indexing about several hundred specific sites. So I need a way to index and “onboard” a new site in an automated way. So I’m wondering if Nutch is the best spider/tool to run through an entire site and the resulting output is a visual graph or text representation of the site’s directory/URL structure when a sitemap file is not available. Scott Lundgren Software Engineer (704) 973-7388 [email protected]<mailto:[email protected]> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™ On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) <[email protected]<mailto:[email protected]>> wrote: Hi Scott, It’s a pretty good tool for that - it is a Web Crawler, which is used to discover the web graph of a domain or of the entire internet - from pages, to documents, to images, to other web resources. Nutch crawls, identifies URLs, fetches them, parses, them and indexes them for search. It can do in a scalable fashion to grow with the size of what you are trying to discover. Does that help? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Scott Lundgren <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, March 30, 2015 at 5:56 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: website structure discovery? If I want to crawl & learn the directory & information structure of a website is nutch a good tool for this problem? Would you recommend a different tool? Scott Lundgren Software Engineer (704) 973-7388 [email protected]<mailto:[email protected]><mailto:[email protected]> QuietStream Financial, LLC<http://www.quietstreamfinancial.com> 11121 Carmel Commons Boulevard | Suite 250 Charlotte, North Carolina 28226 Our Portfolio of Commercial Real Estate Solutions: • <http://www.defeasewithease.com> Commercial Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) • Fairview Real Estate Solutions<http://www.fairviewres.com/> • Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/> • Tax Credit Asset Management<http://www.tcamre.com/> • Radian Generation<http://www.radiangeneration.com/> • EntityKeeper<http://www.entitykeeper.com/>™ • Crowd With Ease<http://www.crowdwithease.com>™ • FullCapitalStack<http://www.fullcapitalstack.com>™ • CrowdRabbit<http://www.crowdrabbit.com>™

