Yep and to add to this, Sujen and Asitang and several of my folks at USC, and JPL, are making a 1.x REST API with the specific intention of showing real-time D3 based viz of the crawls.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Jorge Luis Betancourt González <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, March 30, 2015 at 12:32 PM To: "[email protected]" <[email protected]> Subject: Re: [MASSMAIL]Re: website structure discovery? >What are you using url-regexfilter.txt for? What is your goal? crawl only >the websites of your interest? meaning not "leaving" your seed URLs? If >the website design changes as long as the URLs are the same this >shouldn't be such a big deal. > >By default Nutch doesn't index the link the structure (inlinks & >outlinks) of each page, you can use [1] which will allow you to store >this information in Solr/ES, although this only works for Nutch 1.x, >after this you can write some small application that will generate what >you want, for instance I've used [1] and d3.js to create some simple >graphs about the link structure of the crawled sites, this is not exactly >what you want but can be a starting point. I think that a sitemap >generator shouldn't be too hard to create from the indexed inlinks & >outlinks, or pulling the data directly out of Nutch stored info. > >[1] https://github.com/jorgelbg/links-extractor > >----- Original Message ----- >From: "Scott Lundgren" <[email protected]> >To: [email protected] >Sent: Monday, March 30, 2015 12:48:40 PM >Subject: [MASSMAIL]Re: website structure discovery? > >Sorta. I’m using Nutch to crawl and index very specific areas of content >on a test website resulting in a highly crafted url-regexfilter.tx file. >The downside is a brittle process is a website redesign breaks the setup. >It’s also a slow process that I have to do for each site and eventually I >want to be crawling & indexing about several hundred specific sites. So I >need a way to index and “onboard” a new site in an automated way. > >So I’m wondering if Nutch is the best spider/tool to run through an >entire site and the resulting output is a visual graph or text >representation of the site’s directory/URL structure when a sitemap file >is not available. > >Scott Lundgren >Software Engineer >(704) 973-7388 >[email protected]<mailto:[email protected]> > >QuietStream Financial, LLC<http://www.quietstreamfinancial.com> >11121 Carmel Commons Boulevard | Suite 250 >Charlotte, North Carolina 28226 > >Our Portfolio of Commercial Real Estate Solutions: >• <http://www.defeasewithease.com> Commercial >Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) >• Fairview Real Estate Solutions<http://www.fairviewres.com/> >• Great River Mortgage >Capital<http://www.greatrivermortgagecapital.com/> >• Tax Credit Asset Management<http://www.tcamre.com/> >• Radian Generation<http://www.radiangeneration.com/> >• EntityKeeper<http://www.entitykeeper.com/>™ >• Crowd With Ease<http://www.crowdwithease.com>™ >• FullCapitalStack<http://www.fullcapitalstack.com>™ >• CrowdRabbit<http://www.crowdrabbit.com>™ > >On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) ><[email protected]<mailto:[email protected]>> >wrote: > >Hi Scott, > >It’s a pretty good tool for that - it is a Web Crawler, which >is used to discover the web graph of a domain or of the entire >internet - from pages, to documents, to images, to other web >resources. > >Nutch crawls, identifies URLs, fetches them, parses, them and >indexes them for search. It can do in a scalable fashion to >grow with the size of what you are trying to discover. > >Does that help? > >Cheers, >Chris > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected]<mailto:[email protected]> >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Scott Lundgren <[email protected]<mailto:[email protected]>> >Reply-To: "[email protected]<mailto:[email protected]>" ><[email protected]<mailto:[email protected]>> >Date: Monday, March 30, 2015 at 5:56 AM >To: "[email protected]<mailto:[email protected]>" ><[email protected]<mailto:[email protected]>> >Subject: website structure discovery? > >If I want to crawl & learn the directory & information structure of a >website is nutch a good tool for this problem? >Would you recommend a different tool? > >Scott Lundgren >Software Engineer >(704) 973-7388 >[email protected]<mailto:[email protected]><mailto:slundgren@qsfllc. >com> > >QuietStream Financial, LLC<http://www.quietstreamfinancial.com> >11121 Carmel Commons Boulevard | Suite 250 >Charlotte, North Carolina 28226 > >Our Portfolio of Commercial Real Estate Solutions: >• <http://www.defeasewithease.com> Commercial >Defeasance<http://www.defeasewithease.com/> (Defease With Ease®) >• Fairview Real Estate Solutions<http://www.fairviewres.com/> >• Great River Mortgage >Capital<http://www.greatrivermortgagecapital.com/> >• Tax Credit Asset Management<http://www.tcamre.com/> >• Radian Generation<http://www.radiangeneration.com/> >• EntityKeeper<http://www.entitykeeper.com/>™ >• Crowd With Ease<http://www.crowdwithease.com>™ >• FullCapitalStack<http://www.fullcapitalstack.com>™ >• CrowdRabbit<http://www.crowdrabbit.com>™ > >

