Hello Shiva,
Yes, that is possible, but it (ours) is not a fool proof solution.
We got our first hub classifier years ago in the form of a simple ParseFilter
backed by an SVM. The model was built solely on the HTML of positive and
negative examples, with very few features, so it was extremely
I think you will find that you need different rules for each website and that
some amount of maintenance will be needed as the websites change their
practices.
Hi,
> more control over what is being indexed?
It's possible to enable URL filters for the indexer:
bin/nutch index ... -filter
With little extra effort you can use different URL filter rules
during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
to a different folder.
>> I
Basically what you're saying is that you need more control over what is
being indexed?
That's an excellent question!
Greetz!
On Mar 17, 2018 11:46 AM, "ShivaKarthik S"
wrote:
> Hi,
>
> Is there any way to block the hub pages & index only the articles from the
>
4 matches
Mail list logo