is it a bug within nutch 1.2 when searching the index?

leibnitz Thu, 23 Jun 2011 06:49:13 -0700

i used this version to crawl ,then i deploied the war to tomcat,when search
by site:
site:mail.163.com
the result are listed:
163网易免费邮--中文邮箱第一品牌
163网易免费邮--中文邮箱第一品牌 中文邮箱第 ...
http://mail.163.com/ (cached) (explain) (anchors) (more from mail.163.com)


163网易免费邮--中文邮箱第一品牌
163网易免费邮--中文邮箱第一品牌 中文邮箱第 ...
http://mail.163.com/ (cached) (explain) (anchors) (more from mail.163.com) 

why two both same urls are occured at a search time?
when i used keyword “网易" to search again,the case aboved is occured again...
*but* with some other keywords ,the result is de-duplicated,but why both
cases described above are occured?

when i used nutch 0.9,the results are all correct.
so i lookup the NutchBean.java in nutch1.2 ,some key codes of 1.2:
public Hits search(Query query) throws IOException {
    if (query.getParams().getMaxHitsPerDup() <= 0)   // disable dup checking
      return searchBean.search(query);

    final float rawHitsFactor =
this.conf.getFloat("searcher.hostgrouping.rawhits.factor", 2.0f);
    int numHitsRaw = (int)(query.getParams().getNumHits() * rawHitsFactor);
    if (LOG.isInfoEnabled()) {
      LOG.info("searching for "+numHitsRaw+" raw hits");
    }
    Hits hits = searchBean.search(query);
    final long total = hits.getTotal();
    final Map&lt;String, DupHits&gt; dupToHits = new HashMap&lt;String,
DupHits&gt;();
    final List<Hit> resultList = new ArrayList<Hit>();
    final Set<Hit> seen = new HashSet<Hit>();
    final List<String> excludedValues = new ArrayList<String>();
    boolean totalIsExact = true;
    int optimizeNum = 0;
    
    for (int rawHitNum = 0; rawHitNum < hits.getLength(); rawHitNum++) {
      // get the next raw hit
        if (rawHitNum == (hits.getLength() - 1) && (optimizeNum <
MAX_OPTIMIZE_LOOPS)) { 
        
        // increment the loop
        optimizeNum++;
        ...


nutch 0.9 is below:
public Hits search(Query query, int numHits,
                     int maxHitsPerDup, String dedupField,
                     String sortField, boolean reverse)    throws
IOException {
    if (maxHitsPerDup <= 0)   // disable dup checking.不去重
      return search(query, numHits, dedupField, sortField, reverse);

    float rawHitsFactor =
this.conf.getFloat("searcher.hostgrouping.rawhits.factor", 2.0f);
    int numHitsRaw = (int)(numHits * rawHitsFactor);
    if (LOG.isInfoEnabled()) 
      LOG.info("searching for "+numHitsRaw+" raw hits");
    
    Hits hits = searcher.search(query, numHitsRaw,dedupField, sortField,
reverse);
    long total = hits.getTotal();
    Map dupToHits = new HashMap();
    List resultList = new ArrayList();
    Set seen = new HashSet();
    List excludedValues = new ArrayList();
    boolean totalIsExact = true;
    for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) {
      // get the next raw hit
      if (rawHitNum >= hits.getLength()) {
     ....


thanks in advance!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-it-a-bug-within-nutch-1-2-when-searching-the-index-tp3098675p3098675.html
Sent from the Nutch - User mailing list archive at Nabble.com.

is it a bug within nutch 1.2 when searching the index?

Reply via email to