[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413996 ] 

Andrzej Bialecki  commented on NUTCH-289:
-

Re: lookup in ParseOutputFormat: I respectfully disagree. Consider the scenario 
when you run Fetcher in non-parsing mode. This means that you have to make two 
DNS lookups - once when fetching, and the second time when parsing. These 
lookups will be executed from different processes, so there is no benefit from 
caching inside Java resolver, i.e. the process will have to call the DNS server 
twice. The solution I proposed (record IP-s in Fetcher, but somewhere else than 
in ParseOutputFormat, e.g. crawl_fetch CrawlDatum) avoids this problem.

Another issue is virtual hosting, i.e. many sites resolving to a single IP (web 
hotels). It's true that in many cases these are spam sites, but often as not 
they are real, legitimate sites. If we generate/fetch by IP address we run the 
risk of dropping legitimate sites.

Regarding the timing: it's true that during the first run we won't have IP-s 
during generate (and subsequently for any newly injected URLs). In fact, since 
usually a significant part of crawlDB is unfetched we won't have this 
information for many URLs - unless we run this step in Generator to resolve ALL 
hosts, and then run an equivalent of updatedb to actually record them in 
crawldb.

And the last issue that needs to be discussed: should we use metadata, or add a 
dedicated field in CrawlDatum? If the core should rely on IP addresses, we 
should add it as a dedicated field. If it would be purely optional (e.g. for 
the use by optional plugins), then metadata seems a better place.

 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] 

Doug Cutting commented on NUTCH-289:


It should be possible to partition by IP and limit fetchlists by IP.  Resolving 
only in the fetcher is too late to implement these features.   Ideally we 
should arrange things for good DNS cache utilization, so that urls with the 
same host are resolved in a single map or reduce task.  Currently this is the 
case during fetchlist generation, where lists are partitioned by host.  Might 
that be a good place to insert DNS resolution?  The fetchlists would need to be 
processed one more time, to re-partition and re-limit by IP, but fetchlists are 
relatively small, so this might not slow things too much.  The map task itself 
could directly cache IP addresses, and perhaps even avoid many DNS lookups by 
using the IP from another CrawlDatum from the same host.  A multi-threaded 
mapper might also be used to allow for network latencies.

This should, at least initially, be an optional feature, and thus the IP should 
probably initially be stored in the metadata.  I think it might be added as a 
re-generate step without changing any other code.


 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira