[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic resolved NUTCH-570. ------------------------------------ Resolution: Won't Fix > Improvement of URL Ordering in Generator.java > --------------------------------------------- > > Key: NUTCH-570 > URL: https://issues.apache.org/jira/browse/NUTCH-570 > Project: Nutch > Issue Type: Improvement > Components: generator > Reporter: Ned Rockson > Assignee: Otis Gospodnetic > Priority: Minor > Attachments: GeneratorDiff.out, GeneratorDiff_v1.out > > > [Copied directly from my email to nutch-dev list] > Recently I switched to Fetcher2 over Fetcher for larger whole web fetches > (50-100M at a time). I found that the URLs generated are not optimal because > they are simply randomized by a hash comparator. In one crawl on 24 machines > it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I > had set with regular Fetcher.java this was at least 3 fold more time. > Anyway, I realized that the best situation for ordering can be approached by > randomization, but in order to get optimal ordering, urls from the same host > should be as far apart in the list as possible. So I wrote a series of 2 > map/reduces to optimize the ordering and for a list of 25M documents it takes > about 10 minutes on our cluster. Right now I have it in its own class, but I > figured it can go in Generator.java and just add a flag in nutch-default.xml > determining if the user wants to use it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira