date:20060222

[jira] Closed: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html

2006-02-22 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-188?page=all ]
 
Jerome Charron closed NUTCH-188:


Fix Version: 0.8-dev
 Resolution: Fixed

Duplicated with NUTCH-214

 Add searchable mailing list links to 
 http://lucene.apache.org/nutch/mailing_lists.html
 --

  Key: NUTCH-188
  URL: http://issues.apache.org/jira/browse/NUTCH-188
  Project: Nutch
 Type: Improvement
 Reporter: Andy Liu
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: mailing_list.patch

 Post links to searchable mail archives on nutch.org 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: URL Partitioning (Lexical vs. IP Address)

2006-02-22 Thread Doug Cutting


Chris Schneider wrote:
My experience recently seeing attempted fetches of many ingrida.be URLs 
made me question the Nutch 0.8 algorithm for partitioning URLs among 
TaskTrackers (and their children processes). As I understand it, Nutch 
doesn't worry about two lexically distinct domains (e.g., 
inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being fetched 
simultaneously, even though they might actually resolve to the same IP 
address (66.154.11.25 in this case).


That is correct, Nutch 0.8 currently treats each lexicially-distinct 
domain as a separate domain.  IP-based partitioning is possible: one 
would merely need to change PartitionUrlByHost.java to hash the IP of 
the host.  If the performance of this is too slow, we could cache the IP 
address in the CrawlDatum, which is available when we are performing 
this partitioning.  But probably one should run a caching DNS server 
when fetching anyway, so hopefully that would not be required.


I've attached a patch.  Tell me if it works and if it noticeably slows 
fetching for you.


Doug
Index: src/java/org/apache/nutch/crawl/PartitionUrlByHost.java
===
--- src/java/org/apache/nutch/crawl/PartitionUrlByHost.java	(revision 379848)
+++ src/java/org/apache/nutch/crawl/PartitionUrlByHost.java	(working copy)
@@ -17,6 +17,8 @@
 package org.apache.nutch.crawl;
 
 import java.net.URL;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
 import java.net.MalformedURLException;
 
 import org.apache.hadoop.io.*;
@@ -41,8 +43,22 @@
   url = new URL(urlString);
 } catch (MalformedURLException e) {
 }
-int hashCode = (url==null ? urlString : url.getHost()).hashCode();
 
+int hashCode;
+
+if (url == null) {
+  hashCode = urlString.hashCode();
+} else {
+  String host = url.getHost();
+  try {
+InetAddress addr = InetAddress.getByName(host);
+hashCode = addr.hashCode();
+  } catch (UnknownHostException e) {
+Generator.LOG.info(Couldn't find IP for host:  + host);
+hashCode = host.hashCode();
+  }
+}
+
 // make hosts wind up in different partitions on different runs
 hashCode ^= seed;
 
@@ -50,5 +66,3 @@
   }
   
 }
-
-

Re: Summarier threads in nutch

2006-02-22 Thread Doug Cutting


Jack Tang wrote:

In FetchedSegments class, below code shows how to get the hit summaries.

  public String[] getSummary(HitDetails[] details, Query query)
throws IOException {
SummaryThread[] threads = new SummaryThread[details.length];
for (int i = 0; i  threads.length; i++) {
  threads[i] = new SummaryThread(details[i], query);
  threads[i].start();
}
..
  }

It means if the hits are 1,000,000 items, then 1,000,000 threads
should be spawned.


A user interface typically only asks for 10-to-20 summaries at a time. 
I do not believe that a thread pool would be substantially faster. 
Thread spawning is pretty cheap in most JVMs.


Doug

Re: Summarier threads in nutch

2006-02-22 Thread Jack Tang

On 2/23/06, Doug Cutting [EMAIL PROTECTED] wrote:
 Jack Tang wrote:
  In FetchedSegments class, below code shows how to get the hit summaries.
 
public String[] getSummary(HitDetails[] details, Query query)
  throws IOException {
  SummaryThread[] threads = new SummaryThread[details.length];
  for (int i = 0; i  threads.length; i++) {
threads[i] = new SummaryThread(details[i], query);
threads[i].start();
  }
  ..
}
 
  It means if the hits are 1,000,000 items, then 1,000,000 threads
  should be spawned.

 A user interface typically only asks for 10-to-20 summaries at a time.
Hi Doug
Did I miss something?

SummaryThread[] threads = new SummaryThread[details.length];
here details.length is the size of one page hit items?
I thought it should be the value of all hits, right?

/Jack

 I do not believe that a thread pool would be substantially faster.
 Thread spawning is pretty cheap in most JVMs.

 Doug



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

[jira] Closed: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html

Re: URL Partitioning (Lexical vs. IP Address)

Re: Summarier threads in nutch

Re: Summarier threads in nutch

4 matches

Site Navigation

Mail list logo

Footer information