Re: [MASSMAIL]Re: website structure discovery?

Mattmann, Chris A (3980) Mon, 30 Mar 2015 12:43:08 -0700

Yep and to add to this, Sujen and Asitang and several of my
folks at USC, and JPL, are making a 1.x REST API with the specific
intention of showing real-time D3 based viz of the crawls.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Jorge Luis Betancourt González <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, March 30, 2015 at 12:32 PM
To: "[email protected]" <[email protected]>
Subject: Re: [MASSMAIL]Re: website structure discovery?

>What are you using url-regexfilter.txt for? What is your goal? crawl only
>the websites of your interest? meaning not "leaving" your seed URLs? If
>the website design changes as long as the URLs are the same this
>shouldn't be such a big deal.
>
>By default Nutch doesn't index the link the structure (inlinks &
>outlinks) of each page, you can use [1] which will allow you to store
>this information in Solr/ES, although this only works for Nutch 1.x,
>after this you can write some small application that will generate what
>you want, for instance I've used [1] and d3.js to create some simple
>graphs about the link structure of the crawled sites, this is not exactly
>what you want but can be a starting point. I think that a sitemap
>generator shouldn't be too hard to create from the indexed inlinks &
>outlinks, or pulling the data directly out of Nutch stored info.
>
>[1] https://github.com/jorgelbg/links-extractor
>
>----- Original Message -----
>From: "Scott Lundgren" <[email protected]>
>To: [email protected]
>Sent: Monday, March 30, 2015 12:48:40 PM
>Subject: [MASSMAIL]Re: website structure discovery?
>
>Sorta. I’m using Nutch to crawl and index very specific areas of content
>on a test website resulting in a highly crafted url-regexfilter.tx file.
>The downside is a brittle process is a website redesign breaks the setup.
>It’s also a slow process that I have to do for each site and eventually I
>want to be crawling & indexing about several hundred specific sites. So I
>need a way to index and “onboard” a new site in an automated way.
>
>So I’m wondering if Nutch is the best spider/tool to run through an
>entire site and the resulting output is a visual graph or text
>representation of the site’s directory/URL structure when a sitemap file
>is not available.
>
>Scott Lundgren
>Software Engineer
>(704) 973-7388
>[email protected]<mailto:[email protected]>
>
>QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
>11121 Carmel Commons Boulevard | Suite 250
>Charlotte, North Carolina 28226
>
>Our Portfolio of Commercial Real Estate Solutions:
>•        <http://www.defeasewithease.com> Commercial
>Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
>•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
>•        Great River Mortgage
>Capital<http://www.greatrivermortgagecapital.com/>
>•        Tax Credit Asset Management<http://www.tcamre.com/>
>•        Radian Generation<http://www.radiangeneration.com/>
>•        EntityKeeper<http://www.entitykeeper.com/>™
>•        Crowd With Ease<http://www.crowdwithease.com>™
>•        FullCapitalStack<http://www.fullcapitalstack.com>™
>•        CrowdRabbit<http://www.crowdrabbit.com>™
>
>On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980)
><[email protected]<mailto:[email protected]>>
>wrote:
>
>Hi Scott,
>
>It’s a pretty good tool for that - it is a Web Crawler, which
>is used to discover the web graph of a domain or of the entire
>internet - from pages, to documents, to images, to other web
>resources.
>
>Nutch crawls, identifies URLs, fetches them, parses, them and
>indexes them for search. It can do in a scalable fashion to
>grow with the size of what you are trying to discover.
>
>Does that help?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]<mailto:[email protected]>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Scott Lundgren <[email protected]<mailto:[email protected]>>
>Reply-To: "[email protected]<mailto:[email protected]>"
><[email protected]<mailto:[email protected]>>
>Date: Monday, March 30, 2015 at 5:56 AM
>To: "[email protected]<mailto:[email protected]>"
><[email protected]<mailto:[email protected]>>
>Subject: website structure discovery?
>
>If I want to crawl & learn the directory & information structure of a
>website is nutch a good tool for this problem?
>Would you recommend a different tool?
>
>Scott Lundgren
>Software Engineer
>(704) 973-7388
>[email protected]<mailto:[email protected]><mailto:slundgren@qsfllc.
>com>
>
>QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
>11121 Carmel Commons Boulevard | Suite 250
>Charlotte, North Carolina 28226
>
>Our Portfolio of Commercial Real Estate Solutions:
>•        <http://www.defeasewithease.com> Commercial
>Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
>•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
>•        Great River Mortgage
>Capital<http://www.greatrivermortgagecapital.com/>
>•        Tax Credit Asset Management<http://www.tcamre.com/>
>•        Radian Generation<http://www.radiangeneration.com/>
>•        EntityKeeper<http://www.entitykeeper.com/>™
>•        Crowd With Ease<http://www.crowdwithease.com>™
>•        FullCapitalStack<http://www.fullcapitalstack.com>™
>•        CrowdRabbit<http://www.crowdrabbit.com>™
>
>

Re: [MASSMAIL]Re: website structure discovery?

Reply via email to