Hi all,

 

   This might be a pretty trivial question, but I'm hung up on it.  I've
got a crawl, and it's displaying through the java servlet, and the RSS
feed works great - but I'm getting two results per hostname.  Not more
than that, just two.  I'd thought it could be reeled in with
searcher.hostgrouping.rawhits.factor, but this doesn't seem to be the
case.  I'm trying to bring this down to one result per hostname.

 

   A little further digging makes me believe that I'm also a victim of
the md5 hash bug <https://issues.apache.org/jira/browse/NUTCH-835> , but
there are definitely instances where the results aren't duplicates, but
are too similar to display one right after another (ie
http://www.dunkmall.com/ and http://www.dunkmall.com/order.php
<http://www.dunkmall.com/order.php> ).

 

   Any ideas?  Is there a config setting I'm missing (hopefully)?
Alternatively, do I have to dig into how the searcher works?

 

Thanks!

Rob

Reply via email to