[jira] Closed: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ] Doug Cutting closed NUTCH-179: -- Resolution: Invalid Closed at submitter's request. Proposition: Enable Nutch to use a parser plugin not just based on content type

[jira] Resolved: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Doug Cutting resolved NUTCH-177: Fix Version: 0.8-dev Resolution: Fixed The problem is that your seed url does not end in a slash, yet your url filter requires a slash. In 0.8-dev

[jira] Resolved: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-176?page=all ] Doug Cutting resolved NUTCH-176: Resolution: Won't Fix This check is intentionally made to prevent folks from accidentally overwriting crawls. Using -dir: creates an error, when the

Re: Generating multiple fetchlists between updates

2006-01-19 Thread Doug Cutting
Andrzej Bialecki wrote: In the 0.7 branch, whenever a segment was generated the WebDB was modified, so that the entries that ended up in the fetchlist wouldn't be immediately available to the next segment generation, if that happened before the WebDB was updated with the data from that first

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ] Doug Cutting commented on NUTCH-136: The mapred-default.xml file is actually the best place to set these. mapreduce segment generator generates 50 % less than excepted

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-01-19 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ] Doug Cutting commented on NUTCH-173: Couldn't you instead use a prefix-urlfilter generated from your crawl seed? PerHost Crawling Policy ( crawl.ignore.external.links )

Authentication / Content-type

2006-01-19 Thread Thushara Wijeratna
Hi, I used nutch-0.7.1 to index an intranet. It is a really great tool, thanks for developing it! I had to hack something quick for Authentication (somehow couldn't get the crawler to accept the http.auth.basic.user etc). I also found an issue where parsing an html page returned an error Content

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Version: 0.7.2-dev 0.8-dev Efficient site-specific crawling for a large number of sites

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch-0.8 The previous patch file is valid for 0.7. Here is one that works for 0.8-dev (trunk). (It's three separate one-line additions, to

[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)
Log when db.max configuration limits reached Key: NUTCH-182 URL: http://issues.apache.org/jira/browse/NUTCH-182 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Matt Kangas

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-19 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363352 ] Chris A. Mattmann commented on NUTCH-139: - Hi Jerome, org.apache.nutch.parse.ParseData * The constructor becomes ParseData(ParseStatus, String, Outlink[],

[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ] Matt Kangas updated NUTCH-182: -- Attachment: ParseData.java.patch LinkDb.java.patch Two patches are attached for nutch/trunk (0.8-dev). LinkDb.java.patch adds two new LOG.info()