i used this version to crawl ,then i deploied the war to tomcat,when search by site: site:mail.163.com the result are listed: 163网易免费邮--中文邮箱第一品牌 163网易免费邮--中文邮箱第一品牌 中文邮箱第 ... http://mail.163.com/ (cached) (explain) (anchors) (more from mail.163.com)
163网易免费邮--中文邮箱第一品牌 163网易免费邮--中文邮箱第一品牌 中文邮箱第 ... http://mail.163.com/ (cached) (explain) (anchors) (more from mail.163.com) why two both same urls are occured at a search time? when i used keyword “网易" to search again,the case aboved is occured again... *but* with some other keywords ,the result is de-duplicated,but why both cases described above are occured? when i used nutch 0.9,the results are all correct. so i lookup the NutchBean.java in nutch1.2 ,some key codes of 1.2: public Hits search(Query query) throws IOException { if (query.getParams().getMaxHitsPerDup() <= 0) // disable dup checking return searchBean.search(query); final float rawHitsFactor = this.conf.getFloat("searcher.hostgrouping.rawhits.factor", 2.0f); int numHitsRaw = (int)(query.getParams().getNumHits() * rawHitsFactor); if (LOG.isInfoEnabled()) { LOG.info("searching for "+numHitsRaw+" raw hits"); } Hits hits = searchBean.search(query); final long total = hits.getTotal(); final Map<String, DupHits> dupToHits = new HashMap<String, DupHits>(); final List<Hit> resultList = new ArrayList<Hit>(); final Set<Hit> seen = new HashSet<Hit>(); final List<String> excludedValues = new ArrayList<String>(); boolean totalIsExact = true; int optimizeNum = 0; for (int rawHitNum = 0; rawHitNum < hits.getLength(); rawHitNum++) { // get the next raw hit if (rawHitNum == (hits.getLength() - 1) && (optimizeNum < MAX_OPTIMIZE_LOOPS)) { // increment the loop optimizeNum++; ... nutch 0.9 is below: public Hits search(Query query, int numHits, int maxHitsPerDup, String dedupField, String sortField, boolean reverse) throws IOException { if (maxHitsPerDup <= 0) // disable dup checking.不去重 return search(query, numHits, dedupField, sortField, reverse); float rawHitsFactor = this.conf.getFloat("searcher.hostgrouping.rawhits.factor", 2.0f); int numHitsRaw = (int)(numHits * rawHitsFactor); if (LOG.isInfoEnabled()) LOG.info("searching for "+numHitsRaw+" raw hits"); Hits hits = searcher.search(query, numHitsRaw,dedupField, sortField, reverse); long total = hits.getTotal(); Map dupToHits = new HashMap(); List resultList = new ArrayList(); Set seen = new HashSet(); List excludedValues = new ArrayList(); boolean totalIsExact = true; for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) { // get the next raw hit if (rawHitNum >= hits.getLength()) { .... thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/is-it-a-bug-within-nutch-1-2-when-searching-the-index-tp3098675p3098675.html Sent from the Nutch - User mailing list archive at Nabble.com.

