Re: [MASSMAIL]Re: website structure discovery?

Jorge Luis Betancourt González Mon, 30 Mar 2015 12:36:07 -0700

What are you using url-regexfilter.txt for? What is your goal? crawl only the 
websites of your interest? meaning not "leaving" your seed URLs? If the website 
design changes as long as the URLs are the same this shouldn't be such a big 
deal.


By default Nutch doesn't index the link the structure (inlinks & outlinks) of 
each page, you can use [1] which will allow you to store this information in 
Solr/ES, although this only works for Nutch 1.x, after this you can write some 
small application that will generate what you want, for instance I've used [1] 
and d3.js to create some simple graphs about the link structure of the crawled 
sites, this is not exactly what you want but can be a starting point. I think 
that a sitemap generator shouldn't be too hard to create from the indexed 
inlinks & outlinks, or pulling the data directly out of Nutch stored info.

[1] https://github.com/jorgelbg/links-extractor

----- Original Message -----
From: "Scott Lundgren" <[email protected]>
To: [email protected]
Sent: Monday, March 30, 2015 12:48:40 PM
Subject: [MASSMAIL]Re: website structure discovery?

Sorta. I’m using Nutch to crawl and index very specific areas of content on a 
test website resulting in a highly crafted url-regexfilter.tx file. The 
downside is a brittle process is a website redesign breaks the setup. It’s also 
a slow process that I have to do for each site and eventually I want to be 
crawling & indexing about several hundred specific sites. So I need a way to 
index and “onboard” a new site in an automated way.

So I’m wondering if Nutch is the best spider/tool to run through an entire site 
and the resulting output is a visual graph or text representation of the site’s 
directory/URL structure when a sitemap file is not available.

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial 
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) 
<[email protected]<mailto:[email protected]>> wrote:

Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]<mailto:[email protected]>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, March 30, 2015 at 5:56 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: website structure discovery?

If I want to crawl & learn the directory & information structure of a
website is nutch a good tool for this problem?
Would you recommend a different tool?

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]><mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage
Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

Re: [MASSMAIL]Re: website structure discovery?

Reply via email to