[ 
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600327#comment-14600327
 ] 

Sujen Shah edited comment on NUTCH-2047 at 6/24/15 11:09 PM:
-------------------------------------------------------------

This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage. 

The top few URLs are marked with the same score because most of they are 
unfetched and have been distributed the score by their parent url. 


was (Author: sujenshah):
This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage 

> Improvements to the relevance scoring plugin
> --------------------------------------------
>
>                 Key: NUTCH-2047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2047
>             Project: Nutch
>          Issue Type: Improvement
>          Components: scoring
>            Reporter: Sujen Shah
>              Labels: memex
>             Fix For: 1.11
>
>         Attachments: part-00000
>
>
> To discuss the results and improvements on the scoring-similarity plugin 
> using the cosine similarity model. 
> Currently, the outlinks are distributed the same score as the parent URL. 
> Which means an irrelevant URL(with a relevant parent) would be fetched for 
> one more round before it gets a lower score and filtered. So we would require 
> one additional fetch/parse to score these irrelevant urls(from relevant 
> parents) lower. 
> Any suggestions on this are appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to