Stefan Groschupf wrote:
Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.

But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usable outside the http protocols yet.

IMO, also the code for politeness should be taken out from http
and make it protocol independent.

+ the fetcher does not support plug able protocols - only http.

I see two ways to go.
Refactor the existing robots txt parser and handle but this is a big change.

We should do refactoring, because it would creatly benefit the current fetcher also if we could schedule fetching of robots.txt before we try to get the content itself. eg. fetch the first 100's sites robots.txt and after that start fetching content and unseen robots.txts for sites still on queue (just an example).

Or I may be prefer reimplement robots.txt parsing and handling, this require some more time for me.

In general we should move this discussion into nutch-dev since there are more site effects we should discuss.

now we have it here.

The new fetcher should be an alternative and we should not just remove the old fetcher.

+1

--
 Sami Siren

Reply via email to