RE: Is there any way to block the hubpages while crawling

2018-03-20 Thread Markus Jelsma
Hello Shiva, Yes, that is possible, but it (ours) is not a fool proof solution. We got our first hub classifier years ago in the form of a simple ParseFilter backed by an SVM. The model was built solely on the HTML of positive and negative examples, with very few features, so it was extremely

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Michael Coffey
I think you will find that you need different rules for each website and that some amount of maintenance will be needed as the websites change their practices.

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Sebastian Nagel
Hi, > more control over what is being indexed? It's possible to enable URL filters for the indexer: bin/nutch index ... -filter With little extra effort you can use different URL filter rules during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR to a different folder. >> I

Re: Is there any way to block the hubpages while crawling

2018-03-18 Thread BlackIce
Basically what you're saying is that you need more control over what is being indexed? That's an excellent question! Greetz! On Mar 17, 2018 11:46 AM, "ShivaKarthik S" wrote: > Hi, > > Is there any way to block the hub pages & index only the articles from the >