[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved NUTCH-570.
------------------------------------

    Resolution: Won't Fix

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
> (50-100M at a time).  I found that the URLs generated are not optimal because 
> they are simply randomized by a hash comparator.  In one crawl on 24 machines 
> it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by 
> randomization, but in order to get optimal ordering, urls from the same host 
> should be as far apart in the list as possible.  So I wrote a series of 2 
> map/reduces to optimize the ordering and for a list of 25M documents it takes 
> about 10 minutes on our cluster.  Right now I have it in its own class, but I 
> figured it can go in Generator.java and just add a flag in nutch-default.xml 
> determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to