[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]
Stefan Groschupf updated NUTCH-348: ----------------------------------- Attachment: sortPatchV1.patch What people think about this kind of solution? > Generator is building fetch list using *lowest* scoring URLs > ------------------------------------------------------------ > > Key: NUTCH-348 > URL: http://issues.apache.org/jira/browse/NUTCH-348 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Chris Schneider > Attachments: sortPatchV1.patch > > > Ever since revision 391271, when the CrawlDatum key was replaced by a > FloatWritable key, the Generator.Selector.reduce method has been outputting > the *lowest* scoring URLs! The CrawlDatum class has a Comparator that > essentially treats higher scoring CrawlDatum objects as "less than" lower > scoring CrawlDatum objects, so the higher scoring ones would appear first in > a sequence file sorted using this as the key. > When a FloatWritable based on the score itself (as returned from > scfilters.generatorSortValue) became the sort key, it should have been > negated in Generator.Selector.map to have the same result. Curiously, there > is a comment to this effect immediately before the FloatWritable is set: > // sort by decreasing score > sortValue.set(sort); > It seems like the simplest way to fix this is to just negate the score, and > this seems to work for me: > // sort by decreasing score > // 2006-08-15 CSc REALLY sort by decreasing score > sortValue.set(-sort); > Unfortunately, this means that any crawls that have been done using > Generator.java after revision 391271 should be discarded, as they were > focused on fetching the lowest scoring unfetched URLs in the crawldb, > essentially pointing the crawler 180 degrees from its intended direction. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira