[jira] Closed: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html
[ http://issues.apache.org/jira/browse/NUTCH-188?page=all ] Jerome Charron closed NUTCH-188: Fix Version: 0.8-dev Resolution: Fixed Duplicated with NUTCH-214 Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html -- Key: NUTCH-188 URL: http://issues.apache.org/jira/browse/NUTCH-188 Project: Nutch Type: Improvement Reporter: Andy Liu Priority: Trivial Fix For: 0.8-dev Attachments: mailing_list.patch Post links to searchable mail archives on nutch.org -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: URL Partitioning (Lexical vs. IP Address)
Chris Schneider wrote: My experience recently seeing attempted fetches of many ingrida.be URLs made me question the Nutch 0.8 algorithm for partitioning URLs among TaskTrackers (and their children processes). As I understand it, Nutch doesn't worry about two lexically distinct domains (e.g., inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being fetched simultaneously, even though they might actually resolve to the same IP address (66.154.11.25 in this case). That is correct, Nutch 0.8 currently treats each lexicially-distinct domain as a separate domain. IP-based partitioning is possible: one would merely need to change PartitionUrlByHost.java to hash the IP of the host. If the performance of this is too slow, we could cache the IP address in the CrawlDatum, which is available when we are performing this partitioning. But probably one should run a caching DNS server when fetching anyway, so hopefully that would not be required. I've attached a patch. Tell me if it works and if it noticeably slows fetching for you. Doug Index: src/java/org/apache/nutch/crawl/PartitionUrlByHost.java === --- src/java/org/apache/nutch/crawl/PartitionUrlByHost.java (revision 379848) +++ src/java/org/apache/nutch/crawl/PartitionUrlByHost.java (working copy) @@ -17,6 +17,8 @@ package org.apache.nutch.crawl; import java.net.URL; +import java.net.InetAddress; +import java.net.UnknownHostException; import java.net.MalformedURLException; import org.apache.hadoop.io.*; @@ -41,8 +43,22 @@ url = new URL(urlString); } catch (MalformedURLException e) { } -int hashCode = (url==null ? urlString : url.getHost()).hashCode(); +int hashCode; + +if (url == null) { + hashCode = urlString.hashCode(); +} else { + String host = url.getHost(); + try { +InetAddress addr = InetAddress.getByName(host); +hashCode = addr.hashCode(); + } catch (UnknownHostException e) { +Generator.LOG.info(Couldn't find IP for host: + host); +hashCode = host.hashCode(); + } +} + // make hosts wind up in different partitions on different runs hashCode ^= seed; @@ -50,5 +66,3 @@ } } - -
Re: Summarier threads in nutch
Jack Tang wrote: In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread[details.length]; for (int i = 0; i threads.length; i++) { threads[i] = new SummaryThread(details[i], query); threads[i].start(); } .. } It means if the hits are 1,000,000 items, then 1,000,000 threads should be spawned. A user interface typically only asks for 10-to-20 summaries at a time. I do not believe that a thread pool would be substantially faster. Thread spawning is pretty cheap in most JVMs. Doug
Re: Summarier threads in nutch
On 2/23/06, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread[details.length]; for (int i = 0; i threads.length; i++) { threads[i] = new SummaryThread(details[i], query); threads[i].start(); } .. } It means if the hits are 1,000,000 items, then 1,000,000 threads should be spawned. A user interface typically only asks for 10-to-20 summaries at a time. Hi Doug Did I miss something? SummaryThread[] threads = new SummaryThread[details.length]; here details.length is the size of one page hit items? I thought it should be the value of all hits, right? /Jack I do not believe that a thread pool would be substantially faster. Thread spawning is pretty cheap in most JVMs. Doug -- Keep Discovering ... ... http://www.jroller.com/page/jmars