Re: Connections between pages,Solr schema, url filtering

Tomasz Mon, 15 Feb 2016 00:53:48 -0800

Markus, thanks for the prompt response. You helped me a lot. I set
fetcher.store.content = false and fetcher.parse = true. I used readdblink
instead of Solr and it works great.


Regards
Tomasz


2016-02-12 18:07 GMT+01:00 Markus Jelsma <[email protected]>:

> Hello Tomasz, see inline.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Tomasz <[email protected]>
> > Sent: Friday 12th February 2016 17:47
> > To: [email protected]
> > Subject: Connections between pages,Solr schema, url filtering
> >
> > Hello,
> >
> > I'm developing a project which focuses on connections (links) between
> given
> > websites more than on a content they provide. I chose Nutch to crawl
> those
> > websites and read a lot about the software, but still there are some
> > questions/issues which I hope can be solved with your great help.
> >
> > First of all I don't need to store/index a content and I only need to
> > preserve links with anchors. What is fetcher.store.content settings for?
> Is
> > it possible not to store a content of pages but extract and store only
> > links with anchors and follow those links during crawling?
>
> fetcher.store.content controls whether raw files are stored. You can
> disable it safely for your use case. You can also delete parseText segment
> files, they contain extracted text, which you don't need. There is not yet
> a way to control that via config. By default, metadata and hyperlinks are
> stored.
>
> >
> > In the end I would like to query Solr asking "where (on what websites)
> are
> > the links pointing to abc.com" and get a result with a list of pages
> > pointing to abc.com with a given anchor text. Is that possible?  If yes,
> > how to prepare the schema?
>
> Well, i'd just suggest run the link inverter, it does just that. Check out
> the invertlinks and readlinkdb commands. It returns a list of inlinks for
> any given URL. No need for Solr here. Nutch has the index-anchors plugin,
> it requires the linkdb. It does not index hyperlink but the anchors of
> inlinks. But it is patchable.
>
> >
> > Is the url filter applied every time I generate segment and on every
> > segment already generated or just only on the newest one? Suppose I run
> > generate/fetch/parse/update for a few times and after that I'm going to
> > change url filtering (using regex-urlfilter.txt) - will it be applied to
> > all the links in db resulting in same cases that links which didn't pass
> > the filter earlier can be included at this time and be fetched in next
> go?
>
> URL filters run at various stages, but always at fetch/par
> >
> > Many thanks in advance,
> > Tomasz
> >
>

Re: Connections between pages,Solr schema, url filtering

Reply via email to