Hi Junqiang, looks like you uncovered a bug which is there since years. :( Gongrats and thanks!
See [1] for insert(element) and lessThan(...): Adds element to the PriorityQueue in log(size) time if either the PriorityQueue is not full, or not lessThan(element, top()). In CrawlDbReducer lessThan(Object arg0, Object arg1) returns true if arg0.getScore() > arg1.getScore() resp. if element.getScore() > top().getScore() Then the expression "not lessThan(element, top())" is false and arg0/element isn't added although its score is larger than the least in the InlinkPriorityQueue. The lowest ranking inlinks are kept which is in contradiction to <property> <name>db.update.max.inlinks</name> <value>10000</value> <description>Maximum number of inlinks to take into account when updating a URL score in the crawlDB. Only the best scoring inlinks are kept. </description> </property> > In line 143, which element will be poped by linked.pop()? The CrawlDatum > with the highest score or the one with the lowest score? According to [2] pop() Removes and returns the least element of the PriorityQueue in log(size) time. In this case the behavior is correct though I think that's unintentionally - the lessThan(...) method "defines" the element with the highest score as the "least" one. So, the highest scoring of the 10000 lowest scoring elements is taken first. Of course, it should be "the highest scoring of the 10000 highest scoring elements" At least, afaics. It's not a easy to understand... Could you open an issue on https://issues.apache.org/jira/browse/NUTCH to track the problem? Thanks, Sebastian [1] https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/util/PriorityQueue.html#insert(T) [2] https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/util/PriorityQueue.html#pop() On 05/05/2017 05:50 AM, Junqiang Zhang wrote: > Hello, > > I have a question regarding some code in > org.apache.nutch.crawl.CrawlDbReducer (Nutch version 1.13). > > How does Lines 333-338 define the lessThan() function? Is arg0 with the > higher score of the two arguments defined as the less element of the two? > Which of arg0 and arg1 is poped out first? > > In line 143, which element will be poped by linked.pop()? The CrawlDatum > with the highest score or the one with the lowest score? > > Thanks. > Junqiang >

