[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
Doug Cutting closed NUTCH-179:
--
Resolution: Invalid
Closed at submitter's request.
Proposition: Enable Nutch to use a parser plugin not just based on content
type
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
Doug Cutting resolved NUTCH-177:
Fix Version: 0.8-dev
Resolution: Fixed
The problem is that your seed url does not end in a slash, yet your url filter
requires a slash. In 0.8-dev
[ http://issues.apache.org/jira/browse/NUTCH-176?page=all ]
Doug Cutting resolved NUTCH-176:
Resolution: Won't Fix
This check is intentionally made to prevent folks from accidentally overwriting
crawls.
Using -dir: creates an error, when the
Andrzej Bialecki wrote:
In the 0.7 branch, whenever a segment was generated the WebDB was
modified, so that the entries that ended up in the fetchlist wouldn't be
immediately available to the next segment generation, if that happened
before the WebDB was updated with the data from that first
[
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ]
Doug Cutting commented on NUTCH-136:
The mapred-default.xml file is actually the best place to set these.
mapreduce segment generator generates 50 % less than excepted
[
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ]
Doug Cutting commented on NUTCH-173:
Couldn't you instead use a prefix-urlfilter generated from your crawl seed?
PerHost Crawling Policy ( crawl.ignore.external.links )
Hi,
I used nutch-0.7.1 to index an intranet. It is a really great tool,
thanks for developing it! I had to hack something quick for
Authentication (somehow couldn't get the crawler to accept the
http.auth.basic.user etc). I also found an issue where parsing an html
page returned an error Content
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-
Version: 0.7.2-dev
0.8-dev
Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-
Attachment: build.xml.patch-0.8
The previous patch file is valid for 0.7. Here is one that works for 0.8-dev
(trunk).
(It's three separate one-line additions, to
Log when db.max configuration limits reached
Key: NUTCH-182
URL: http://issues.apache.org/jira/browse/NUTCH-182
Project: Nutch
Type: Improvement
Components: fetcher
Versions: 0.8-dev
Reporter: Matt Kangas
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363352 ]
Chris A. Mattmann commented on NUTCH-139:
-
Hi Jerome,
org.apache.nutch.parse.ParseData
* The constructor becomes ParseData(ParseStatus, String, Outlink[],
[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]
Matt Kangas updated NUTCH-182:
--
Attachment: ParseData.java.patch
LinkDb.java.patch
Two patches are attached for nutch/trunk (0.8-dev).
LinkDb.java.patch adds two new LOG.info()
12 matches
Mail list logo