RE: Is there any way to block the hubpages while crawling

2018-03-20 Thread Markus Jelsma
Hello Shiva,

Yes, that is possible, but it (ours) is not a fool proof solution.

We got our first hub classifier years ago in the form of a simple ParseFilter 
backed by an SVM. The model was built solely on the HTML of positive and 
negative examples, with very few features, so it was extremely unreliable for 
sites that weren't part of the training set.

Today we operate a hierarchic set of SVMs that get tons of features from 
pre-analyzed structures in HTML. It helped a great deal because first we try to 
figure out what kind of website it is, and only then whether it is a hub. It 
was easier to know a hub page, if you know if the site is a forum, a regular 
news/blog site, a wiki or a webshop.

I know this is not the answer you are looking for, but if you analyze HTML, get 
data structures out of it and use those as features for SVMs, you are on your 
way. It worked for us at least.

Regards,
Markus
 
-Original message-
> From:Sebastian Nagel 
> Sent: Tuesday 20th March 2018 13:21
> To: user@nutch.apache.org
> Subject: Re: Is there any way to block the hubpages while crawling
> 
> Hi,
> 
> > more control over what is being indexed?
> 
> It's possible to enable URL filters for the indexer:
>    bin/nutch index ... -filter
> With little extra effort you can use different URL filter rules
> during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
> to a different folder.
> 
> >> I can't generalize any rule
> 
> What about to classify hubs by number of outlinks?
> Then you could skip those pages using an indexing-filter, just return
> null if a document shall be skipped.
> For a larger crawl you'll definitely get lost with a URL filter.
> 
> Maybe you can also see this as a ranking problem: if hub pages are
> only penalized you could apply simple but noisy heuristics.
> 
> Best,
> Sebastian
> 
> On 03/18/2018 10:10 AM, BlackIce wrote:
> > Basically what you're saying is that you need more control over what is
> > being indexed?
> > 
> > That's an excellent question!
> > 
> > Greetz!
> > 
> > On Mar 17, 2018 11:46 AM, "ShivaKarthik S" 
> > wrote:
> > 
> >> Hi,
> >>
> >> Is there any way to block the hub pages & index only the articles from the
> >> websites. I wanted to index only the articles & not hubpage. Hub pages will
> >> be crawled & the outlines will be extracted, but while indexing, I needed
> >> only the articles to be indexed.
> >> E.g.
> >> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
> >> & www.abc.com/ABC/1.html is an article.
> >>
> >> In this case I can block all the urls not ending with .html or .aspx or
> >> .JSP or any other extensions. But all the websites need not be following
> >> same format. Some follow . html for hub pages as well as articles & some
> >> follow no extension for both hub pages as well as articles. Considering
> >> these cases, I can't generalize any rule saying that whichever is ending
> >> without extension is hubpage & whichever is ending with extension is
> >> article. Is there any way in nutch 1.x this can be handled?
> >>
> >> Thanks & regards
> >> Shiva
> >>
> >>
> >> --
> >> Thanks and Regards
> >> Shiva
> >>
> > 
> 
> 


Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Michael Coffey
I think you will find that you need different rules for each website and that 
some amount of maintenance will be needed as the websites change their 
practices.


Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Sebastian Nagel
Hi,

> more control over what is being indexed?

It's possible to enable URL filters for the indexer:
   bin/nutch index ... -filter
With little extra effort you can use different URL filter rules
during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
to a different folder.

>> I can't generalize any rule

What about to classify hubs by number of outlinks?
Then you could skip those pages using an indexing-filter, just return
null if a document shall be skipped.
For a larger crawl you'll definitely get lost with a URL filter.

Maybe you can also see this as a ranking problem: if hub pages are
only penalized you could apply simple but noisy heuristics.

Best,
Sebastian

On 03/18/2018 10:10 AM, BlackIce wrote:
> Basically what you're saying is that you need more control over what is
> being indexed?
> 
> That's an excellent question!
> 
> Greetz!
> 
> On Mar 17, 2018 11:46 AM, "ShivaKarthik S" 
> wrote:
> 
>> Hi,
>>
>> Is there any way to block the hub pages & index only the articles from the
>> websites. I wanted to index only the articles & not hubpage. Hub pages will
>> be crawled & the outlines will be extracted, but while indexing, I needed
>> only the articles to be indexed.
>> E.g.
>> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
>> & www.abc.com/ABC/1.html is an article.
>>
>> In this case I can block all the urls not ending with .html or .aspx or
>> .JSP or any other extensions. But all the websites need not be following
>> same format. Some follow . html for hub pages as well as articles & some
>> follow no extension for both hub pages as well as articles. Considering
>> these cases, I can't generalize any rule saying that whichever is ending
>> without extension is hubpage & whichever is ending with extension is
>> article. Is there any way in nutch 1.x this can be handled?
>>
>> Thanks & regards
>> Shiva
>>
>>
>> --
>> Thanks and Regards
>> Shiva
>>
> 



Re: Is there any way to block the hubpages while crawling

2018-03-18 Thread BlackIce
Basically what you're saying is that you need more control over what is
being indexed?

That's an excellent question!

Greetz!

On Mar 17, 2018 11:46 AM, "ShivaKarthik S" 
wrote:

> Hi,
>
> Is there any way to block the hub pages & index only the articles from the
> websites. I wanted to index only the articles & not hubpage. Hub pages will
> be crawled & the outlines will be extracted, but while indexing, I needed
> only the articles to be indexed.
> E.g.
> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
> & www.abc.com/ABC/1.html is an article.
>
> In this case I can block all the urls not ending with .html or .aspx or
> .JSP or any other extensions. But all the websites need not be following
> same format. Some follow . html for hub pages as well as articles & some
> follow no extension for both hub pages as well as articles. Considering
> these cases, I can't generalize any rule saying that whichever is ending
> without extension is hubpage & whichever is ending with extension is
> article. Is there any way in nutch 1.x this can be handled?
>
> Thanks & regards
> Shiva
>
>
> --
> Thanks and Regards
> Shiva
>