Re: Recrawling (Tomi NA)

2006-09-07 Thread David Wallace
Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a different session id. If this is what's causing

Re: help with creating a directory ie front page menu of common terms

2006-03-08 Thread David Wallace
This is true. What I do is I have Nutch log all the searches. Every few weeks, I grab the most common search terms out of the log and turn them into my common searches menu. Although having a manual process is not desirable, it does remove the possibility that a spammer will sabotage my

RE: project vitality?

2006-03-05 Thread David Wallace
Hello all, I think Nutch is a fantastic product. I used 0.6 initially, then 0.7. My 0.7 installation is in production, and mostly works really well. I haven't made the move to 0.8 yet, because the direction that Nutch has gone for 0.8 is quite different from what my organisation requires from

Storing redirections in segment

2006-02-19 Thread David Wallace
Hi all, I'm running a fairly old build of 0.7, so please accept my apologies if what I'm describing has been changed in a later release. It seems that if a URL gets redirected during a crawl, then it's the original URL, not the redirected version, that gets stored in the segment and indexed.

Re: Dynamic merging of indices

2006-02-08 Thread David Wallace
Hi Ravi, This is almost exactly what I've done. I create a new NutchBean for each search, and point it at whichever of 9 subdirectories the user has selected; because I really don't want 511 (2^9-1) beans hanging around. The reason for the too many open files is that the NutchBean doesn't clean

Re: Nutch-general digest, Vol 1 #935 - 8 msgs

2006-02-07 Thread David Wallace
Hi Saravanaraj, For each URL, Nutch reads your filter file from top to bottom, until it finds a line (+ or -) that matches the URL. Then it stops reading. Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED, because they match the line that says +^file:/E:/Index Samples/. I

Re: Background color searched word

2006-01-11 Thread David Wallace
Hi Andy, I don't know which version of Nutch you're using, but in 0.7, it's in the Summary.Highlight class in org.apache.nutch.searcher. It should be clear how to change it. Regards, David Wallace. Date: Wed, 11 Jan 2006 15:20:05 -0500 From: Andy Morris [EMAIL PROTECTED] To: nutch-user

Re: java.io.IOException: already exists

2006-01-04 Thread David Wallace
I used to get this occasionally too, running Windows 2000. It looks to me like one of the tools can sometimes fail, for whatever reason, but leave some kind of OS-level locks on some of the files. This is usually the UpdateDatabaseTool. On the FOLLOWING run of the UpdateDatabaseTool, the files

Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. I'm using Nutch for an intranet style search engine, on a single site, so I don't

Re: Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
. This is an effective way to limit the size of the link database, keeping the only the highest quality links. /description /property ... setting to false? Stefan Am 20.12.2005 um 00:49 schrieb David Wallace: Hi all, I've been grubbing around with Nutch for a while now, although I'm still working