Markus, thanks for the prompt response. You helped me a lot. I set fetcher.store.content = false and fetcher.parse = true. I used readdblink instead of Solr and it works great.
Regards Tomasz 2016-02-12 18:07 GMT+01:00 Markus Jelsma <[email protected]>: > Hello Tomasz, see inline. > > Regards, > Markus > > -----Original message----- > > From:Tomasz <[email protected]> > > Sent: Friday 12th February 2016 17:47 > > To: [email protected] > > Subject: Connections between pages,Solr schema, url filtering > > > > Hello, > > > > I'm developing a project which focuses on connections (links) between > given > > websites more than on a content they provide. I chose Nutch to crawl > those > > websites and read a lot about the software, but still there are some > > questions/issues which I hope can be solved with your great help. > > > > First of all I don't need to store/index a content and I only need to > > preserve links with anchors. What is fetcher.store.content settings for? > Is > > it possible not to store a content of pages but extract and store only > > links with anchors and follow those links during crawling? > > fetcher.store.content controls whether raw files are stored. You can > disable it safely for your use case. You can also delete parseText segment > files, they contain extracted text, which you don't need. There is not yet > a way to control that via config. By default, metadata and hyperlinks are > stored. > > > > > In the end I would like to query Solr asking "where (on what websites) > are > > the links pointing to abc.com" and get a result with a list of pages > > pointing to abc.com with a given anchor text. Is that possible? If yes, > > how to prepare the schema? > > Well, i'd just suggest run the link inverter, it does just that. Check out > the invertlinks and readlinkdb commands. It returns a list of inlinks for > any given URL. No need for Solr here. Nutch has the index-anchors plugin, > it requires the linkdb. It does not index hyperlink but the anchors of > inlinks. But it is patchable. > > > > > Is the url filter applied every time I generate segment and on every > > segment already generated or just only on the newest one? Suppose I run > > generate/fetch/parse/update for a few times and after that I'm going to > > change url filtering (using regex-urlfilter.txt) - will it be applied to > > all the links in db resulting in same cases that links which didn't pass > > the filter earlier can be included at this time and be fetched in next > go? > > URL filters run at various stages, but always at fetch/par > > > > Many thanks in advance, > > Tomasz > > >

