Connections between pages,Solr schema, url filtering

Tomasz Fri, 12 Feb 2016 08:47:32 -0800

Hello,

I'm developing a project which focuses on connections (links) between given
websites more than on a content they provide. I chose Nutch to crawl those
websites and read a lot about the software, but still there are some
questions/issues which I hope can be solved with your great help.


First of all I don't need to store/index a content and I only need to
preserve links with anchors. What is fetcher.store.content settings for? Is
it possible not to store a content of pages but extract and store only
links with anchors and follow those links during crawling?

In the end I would like to query Solr asking "where (on what websites) are
the links pointing to abc.com" and get a result with a list of pages
pointing to abc.com with a given anchor text. Is that possible?  If yes,
how to prepare the schema?

Is the url filter applied every time I generate segment and on every
segment already generated or just only on the newest one? Suppose I run
generate/fetch/parse/update for a few times and after that I'm going to
change url filtering (using regex-urlfilter.txt) - will it be applied to
all the links in db resulting in same cases that links which didn't pass
the filter earlier can be included at this time and be fetched in next go?

Many thanks in advance,
Tomasz

Connections between pages,Solr schema, url filtering

Reply via email to