[jira] Created: (NUTCH-384) When using the file protocol one can not map a parse plugin to a content type. The only way to get the plugin called is through the default plugin. The issue is that the co

2006-10-11 Thread Paul Ramirez (JIRA)
When using the file protocol one can not map a parse plugin to a content type. 
The only way to get the plugin called is through the default plugin. The issue 
is that the content type never gets mapped.
-

 Key: NUTCH-384
 URL: http://issues.apache.org/jira/browse/NUTCH-384
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.8
 Environment: All
Reporter: Paul Ramirez


Currently the content type does not get set by the file protocol.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2006-10-11 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-384?page=all ]

Chris A. Mattmann updated NUTCH-384:


Summary: Protocol-file plugin does not allow the parse plugins 
framework to operate properly  (was: When using the file protocol one can not 
map a parse plugin to a content type. The only way to get the plugin called is 
through the default plugin. The issue is that the content type never gets 
mapped.)
Description: When using the file protocol one can not map a parse plugin to 
a content type. The only way to get the plugin called is through the default 
plugin. The issue is that the content type never gets mapped. Currently the 
content type does not get set by the file protocol.  (was: Currently the 
content type does not get set by the file protocol.)

Moved title into description and made shorter title.

 Protocol-file plugin does not allow the parse plugins framework to operate 
 properly
 ---

 Key: NUTCH-384
 URL: http://issues.apache.org/jira/browse/NUTCH-384
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.8
 Environment: All
Reporter: Paul Ramirez

 When using the file protocol one can not map a parse plugin to a content 
 type. The only way to get the plugin called is through the default plugin. 
 The issue is that the content type never gets mapped. Currently the content 
 type does not get set by the file protocol.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
Server delay feature conflicts with maxThreadsPerHost
-

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider


For some time I've been puzzled by the interaction between two paramters that 
control how often the fetcher can access a particular host:

1) The server delay, which comes back from the remote server during our 
processing of the robots.txt file, and which can be limited by 
fetcher.max.crawl.delay.

2) The fetcher.threads.per.host value, particularly when this is greater than 
the default of 1.

According to my (limited) understanding of the code in HttpBase.java:

Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
continuously. In other words, it never tries to point 3 at the host, and it 
always points a second thread at the host before the first thread finishes 
accessing it. Since HttpBase.unblockAddr never gets called with 
(((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
Thus, the server delay will never be used at all. The fetcher will be 
continuously retrieving pages from the host, often with 2 fetchers accessing 
the host simultaneously.

Suppose instead that the fetcher finally does allow the last thread to complete 
before it gets around to pointing another thread at the target host. When the 
last fetcher thread calls HttpBase.unblockAddr, it will now put 
System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. 
This, in turn, will prevent any threads from accessing this host until the 
delay is complete, even though zero threads are currently accessing the host.

I see this behavior as inconsistent. More importantly, the current 
implementation certainly doesn't seem to answer my original question about 
appropriate definitions for what appear to be conflicting parameters. 

In a nutshell, how could we possibly honor the server delay if we allow more 
than one fetcher thread to simultaneously access the host?

It would be one thing if whenever (fetcher.threads.per.host  1), this trumped 
the server delay, causing the latter to be ignored completely. That is 
certainly not the case in the current implementation, as it will wait for 
server delay whenever the number of threads accessing a given host drops to 
zero.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441528 ] 

Chris Schneider commented on NUTCH-385:
---

This comment was actually made by Andrzej in response to an email containing 
the analysis above that I sent him before creating this JIRA issue:

Let's start with defining what is the desired semantics of these two parameters 
together. In my opinion it's the following:

* if only 1 thread per host is allowed, at any given moment at most one thread 
should be accessing the host, and the interval between consecutive requests 
should be at least crawlDelay (whichever way we determine this value - from 
config, from robots.txt or external sources such as partner agreements).

* if two or more (for example N) threads per host are allowed, at any given 
moment at most N threads should be accessing the host, and the interval between 
consecutive requests should be at least crawlDelay - that is, the interval 
between when one of the threads finishes, and another starts requesting.

I.e.: for threads.per.host=2 and crawlDelay=3 seconds, if we start 3 threads 
trying to access the same host we should get something like this (time in [s] 
on the x axis, # - start request, + - request in progress, b - blocked in 
per-host limit, c - obeying crawlDelay):

===0 1 2
===01234567890123456789012345678
1: #+++cccbbccc#cccbb#++
2: #cccbcccbcc#+++cb
3: ccc#+ccc#+ccc#+++

As you can see, at any given time we have at most 2 threads accessing the site, 
and the interval between consecutive requests is at least 3 seconds. Especially 
interesting in the above graph is the period between 17-18 seconds - thread 2 
had to be delayed additional 2 seconds to satisfy the crawl delay requirement, 
even though the threads.per.host requirement was satisfied.

[snip]

It's a question of priorities - in the model I drafted above the topmost 
priority is the observance of crawlDelay, sometimes at the cost of the number 
of concurrent threads (see seconds 17-18). In this model, the code should 
always put the delay in BLOCKED_ADDR_TO_TIME, in order to wait at least 
crawlDelay after _any_ thread finishes. We could use an alternative model, 
where crawlDelay is measured from the start of the request, and not from the 
end - see the graph below:

===0 1 2 3
===01234567890123456789012345678901234567
1: #+++cbbb##++cc#+++
2: ccc#cc#+++c#c#
3: cc#+ccc#+ccc#+ccbb

but it seems to me that it's more complicated, gives less requests/sec, and the 
interpretaion of crawlDelay's meaning is stretched ...

[snip]

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay 

[jira] Commented: (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2006-10-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12441529 ] 

Chris Schneider commented on NUTCH-385:
---

This comment was actually made by Ken Krugler, who was responding to Andrzej's 
comment above:

[with respect to Andrzej's definitions at the beginning of his comment - Ed.:]
I agree that this is one of two possible interpretations. The other is that 
there are N virtual users, and there crawlDelay applies to each of these 
virtual users in isolation.

Using the same type of request data from above, I see a queue of requests with 
the following durations (in seconds):

4, 9, 6, 5, 6, 4, 7, 4

So with the virtual user model (where N = 2, thus A and B users), I get:

===0 1 2
===01234567890123456789012345678
A: 4+++ccc6+ccc6+ccc7++
B: 9ccc5ccc4+++ccc4+++

The numbers mark the start of each new request, and the total duration for the 
request.

This would seem to be less efficient than your approach, but somehow feels more 
in the nature of what threads.per.host really means.

Let's see, for N = 3 this would look like:

===0 1 2
===01234567890123456789012345678
A: 4+++ccc5ccc7++ccc
B: 9ccc4+++ccc
C: 6+ccc6+ccc4+++ccc

[snip]

To implement the virtual users model, each unique domain being actively fetched 
from would need to have N bits of state tracking the time of completion of the 
last request.

Anyway, just an alternative interpretation...


 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: http://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira