Spaces in URLs

2006-10-25 Thread Scott Hayes
Are there any plans to address the issue described in the post below? URL's with spaces are still not possible in nutch. http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01322.html

RE: nutch 0.8 (+ hadoop 0.5) does not crawl reliably

2006-10-25 Thread Teruhiko Kurosaka
Andrzej, This worked very well! Thanks! > I meant setting the fs.default.name to local, and > mapred.job.tracker to > local. > > Set mapred.map.tasks to 1, and mapred.reduce.tasks to 1. You > can achieve > the same effect for "generate" if you use -numFetchers 1.

Re: nutch 0.8 (+ hadoop 0.5) does not crawl reliably

2006-10-25 Thread Andrzej Bialecki
Teruhiko Kurosaka wrote: From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: 2006-10-24 18:27 There was a bug in some versions of 0.8, so that if you ran it with "local" FS & jobtracker it would generate too many parts of the fetchlist, and then process only one randomly selec

RE: nutch 0.8 (+ hadoop 0.5) does not crawl reliably

2006-10-25 Thread Teruhiko Kurosaka
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: 2006-10-24 18:27 > There was a bug in some versions of 0.8, so that if you ran it with > "local" FS & jobtracker it would generate too many parts of the > fetchlist, and then process only one randomly selected part. > If that's > th

Re: Preventing pages to be indexed based on content

2006-10-25 Thread Andrzej Bialecki
Eelco Lempsink wrote: Of course, for high volumes of data first indexing, and afterwards removing it, doesn't sound like a good option in my case where only a small part of the fetched data needs to be indexed. Has anyone solved this problem (elegantly)? I mainly wonder if it's feasible to d

Preventing pages to be indexed based on content

2006-10-25 Thread Eelco Lempsink
Hello, I'm looking for a solution to a problem typical to domain specific search engines: a way to prevent certain pages to be indexed, based on their content, but keeping the outlinks of the page. When searching this mailing list I noticed this question (or something similar) being aske

Re: Fetching outside the domain ?

2006-10-25 Thread Tomi NA
2006/10/23, Andrzej Bialecki <[EMAIL PROTECTED]>: Tomi NA wrote: > 2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: > >> Btw we have some virtual local hosts, hoz does the >> db.ignore.external.links >> deal with that ? > > Update: > setting db.ignore.external.links to true in nutch-site (and l