A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.free.fr/
Jérôme Charron wrote:
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of Response content
Andrzej Bialecki wrote:
Hmm... I'm not saying it's flawless, there were surely some mysterious
things going on with it. That large crawl you mention, was it with the
(recently updated in Nutch) release 3.0? What were the issues?
No, it was in early December, with the previous version. I
Stefan Groschupf wrote:
Anyway today we note that when fetching with http-client the sum of
errors and fetched pages is much less than the size defined when
generating the segment.
Changing to protocol-http solves the problem.
Has anyone also note this behavior?
I haven't, but this
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard configuration.
Am 19.12.2005
The same problem on FreeBSD 6.0 + jdk1.4.2
I think it was also reported some time ago by Rod Taylor.
Switch to protocol-http.
SG Hi there,
SG is there someone out there that can confirm a problem we discovered?
SG We was wondering why not all pages of a generated segments was
SG fetched.
Stefan Groschupf wrote:
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard