Hello, I'm developing a project which focuses on connections (links) between given websites more than on a content they provide. I chose Nutch to crawl those websites and read a lot about the software, but still there are some questions/issues which I hope can be solved with your great help.
First of all I don't need to store/index a content and I only need to preserve links with anchors. What is fetcher.store.content settings for? Is it possible not to store a content of pages but extract and store only links with anchors and follow those links during crawling? In the end I would like to query Solr asking "where (on what websites) are the links pointing to abc.com" and get a result with a list of pages pointing to abc.com with a given anchor text. Is that possible? If yes, how to prepare the schema? Is the url filter applied every time I generate segment and on every segment already generated or just only on the newest one? Suppose I run generate/fetch/parse/update for a few times and after that I'm going to change url filtering (using regex-urlfilter.txt) - will it be applied to all the links in db resulting in same cases that links which didn't pass the filter earlier can be included at this time and be fetched in next go? Many thanks in advance, Tomasz

