RE: Connections between pages,Solr schema, url filtering

Markus Jelsma Fri, 12 Feb 2016 09:08:50 -0800

Hello Tomasz, see inline.

Regards,
Markus
 
-----Original message-----
> From:Tomasz <[email protected]>
> Sent: Friday 12th February 2016 17:47
> To: [email protected]
> Subject: Connections between pages,Solr schema, url filtering
> 
> Hello,
> 
> I'm developing a project which focuses on connections (links) between given
> websites more than on a content they provide. I chose Nutch to crawl those
> websites and read a lot about the software, but still there are some
> questions/issues which I hope can be solved with your great help.
> 
> First of all I don't need to store/index a content and I only need to
> preserve links with anchors. What is fetcher.store.content settings for? Is
> it possible not to store a content of pages but extract and store only
> links with anchors and follow those links during crawling?


fetcher.store.content controls whether raw files are stored. You can disable it 
safely for your use case. You can also delete parseText segment files, they 
contain extracted text, which you don't need. There is not yet a way to control 
that via config. By default, metadata and hyperlinks are stored.

> 
> In the end I would like to query Solr asking "where (on what websites) are
> the links pointing to abc.com" and get a result with a list of pages
> pointing to abc.com with a given anchor text. Is that possible?  If yes,
> how to prepare the schema?

Well, i'd just suggest run the link inverter, it does just that. Check out the 
invertlinks and readlinkdb commands. It returns a list of inlinks for any given 
URL. No need for Solr here. Nutch has the index-anchors plugin, it requires the 
linkdb. It does not index hyperlink but the anchors of inlinks. But it is 
patchable.

> 
> Is the url filter applied every time I generate segment and on every
> segment already generated or just only on the newest one? Suppose I run
> generate/fetch/parse/update for a few times and after that I'm going to
> change url filtering (using regex-urlfilter.txt) - will it be applied to
> all the links in db resulting in same cases that links which didn't pass
> the filter earlier can be included at this time and be fetched in next go?

URL filters run at various stages, but always at fetch/par
> 
> Many thanks in advance,
> Tomasz
>

RE: Connections between pages,Solr schema, url filtering

Reply via email to