Re: [MASSMAIL]Re: [MASSMAIL]Re: website structure discovery?

Jorge Luis Betancourt González Tue, 31 Mar 2015 05:40:26 -0700

Inline response,

----- Original Message -----
From: "Scott Lundgren" <[email protected]>
To: [email protected]
Sent: Monday, March 30, 2015 5:14:31 PM
Subject: [MASSMAIL]Re: [MASSMAIL]Re: website structure discovery?


I’m using url-regexfilter.txt to not only keep nutch from leaving a site that’s 
in seed.txt but also to keep nutch very focussed on the URLs within the seed I 
want nutch to crawl. For example in seed.txt is 
http://bizjournals.com/triangle, I want to crawl 
http://www.bizjournals.com/triangle/news but not 
http://www.bizjournals.com/triangle/jobs/, 
http://www.bizjournals.com/triangle/calendar/ or 
http://www.bizjournals.com/triangle/people/

I understand your use case but the question here is if you have some heuristic 
or deterministic way of defining what you want or don't want to crawl? Because 
perhaps you can implement that logic into a custom plugin, that will decide if 
an URL is valid for crawling or not. For the examples that you've provided I 
don't see any way other than manual inspection, which is time consuming indeed. 

Figuring out these regex’s involves me mousing over links of a site in Chrome 
browser and a text-only browser. It’s a little time consuming and I have a 200+ 
sites to set up. I’ll trying standing up a separate instance of nutch plus the 
link-extractor and D3.js solution.

This plugin should allow you a more simple view of your sites structure (with 
the help of d3.js) but then you'll need two passes over your seed URLs, 1 to 
populate Solr/ES with the link structure of each page and other to just crawl 
the actual URLs of interest. 

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial 
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 3:32 PM, Jorge Luis Betancourt González 
<[email protected]<mailto:[email protected]>> wrote:

What are you using url-regexfilter.txt for? What is your goal? crawl only the 
websites of your interest? meaning not "leaving" your seed URLs? If the website 
design changes as long as the URLs are the same this shouldn't be such a big 
deal.

By default Nutch doesn't index the link the structure (inlinks & outlinks) of 
each page, you can use [1] which will allow you to store this information in 
Solr/ES, although this only works for Nutch 1.x, after this you can write some 
small application that will generate what you want, for instance I've used [1] 
and d3.js to create some simple graphs about the link structure of the crawled 
sites, this is not exactly what you want but can be a starting point. I think 
that a sitemap generator shouldn't be too hard to create from the indexed 
inlinks & outlinks, or pulling the data directly out of Nutch stored info.

[1] https://github.com/jorgelbg/links-extractor

----- Original Message -----
From: "Scott Lundgren" <[email protected]<mailto:[email protected]>>
To: [email protected]<mailto:[email protected]>
Sent: Monday, March 30, 2015 12:48:40 PM
Subject: [MASSMAIL]Re: website structure discovery?

Sorta. I’m using Nutch to crawl and index very specific areas of content on a 
test website resulting in a highly crafted url-regexfilter.tx file. The 
downside is a brittle process is a website redesign breaks the setup. It’s also 
a slow process that I have to do for each site and eventually I want to be 
crawling & indexing about several hundred specific sites. So I need a way to 
index and “onboard” a new site in an automated way.

So I’m wondering if Nutch is the best spider/tool to run through an entire site 
and the resulting output is a visual graph or text representation of the site’s 
directory/URL structure when a sitemap file is not available.

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]><mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial 
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
 wrote:

Hi Scott,

It’s a pretty good tool for that - it is a Web Crawler, which
is used to discover the web graph of a domain or of the entire
internet - from pages, to documents, to images, to other web
resources.

Nutch crawls, identifies URLs, fetches them, parses, them and
indexes them for search. It can do in a scalable fashion to
grow with the size of what you are trying to discover.

Does that help?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: 
[email protected]<mailto:[email protected]><mailto:[email protected]>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Scott Lundgren 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Date: Monday, March 30, 2015 at 5:56 AM
To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Subject: website structure discovery?

If I want to crawl & learn the directory & information structure of a
website is nutch a good tool for this problem?
Would you recommend a different tool?

Scott Lundgren
Software Engineer
(704) 973-7388
[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>

QuietStream Financial, LLC<http://www.quietstreamfinancial.com>
11121 Carmel Commons Boulevard | Suite 250
Charlotte, North Carolina 28226

Our Portfolio of Commercial Real Estate Solutions:
•        <http://www.defeasewithease.com> Commercial
Defeasance<http://www.defeasewithease.com/> (Defease With Ease®)
•        Fairview Real Estate Solutions<http://www.fairviewres.com/>
•        Great River Mortgage
Capital<http://www.greatrivermortgagecapital.com/>
•        Tax Credit Asset Management<http://www.tcamre.com/>
•        Radian Generation<http://www.radiangeneration.com/>
•        EntityKeeper<http://www.entitykeeper.com/>™
•        Crowd With Ease<http://www.crowdwithease.com>™
•        FullCapitalStack<http://www.fullcapitalstack.com>™
•        CrowdRabbit<http://www.crowdrabbit.com>™

Re: [MASSMAIL]Re: [MASSMAIL]Re: website structure discovery?

Reply via email to