Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a different session id. If this is
what's causing
This is true. What I do is I have Nutch log all the searches. Every few
weeks, I grab the most common search terms out of the log and turn them into my
common searches menu. Although having a manual process is not desirable, it
does remove the possibility that a spammer will sabotage my
Hello all,
I think Nutch is a fantastic product. I used 0.6 initially, then 0.7.
My 0.7 installation is in production, and mostly works really well. I
haven't made the move to 0.8 yet, because the direction that Nutch has
gone for 0.8 is quite different from what my organisation requires from
Hi all,
I'm running a fairly old build of 0.7, so please accept my apologies if
what I'm describing has been changed in a later release.
It seems that if a URL gets redirected during a crawl, then it's the
original URL, not the redirected version, that gets stored in the
segment and indexed.
Hi Ravi,
This is almost exactly what I've done. I create a new NutchBean for
each search, and point it at whichever of 9 subdirectories the user has
selected; because I really don't want 511 (2^9-1) beans hanging around.
The reason for the too many open files is that the NutchBean doesn't
clean
Hi Saravanaraj,
For each URL, Nutch reads your filter file from top to bottom, until it
finds a line (+ or -) that matches the URL. Then it stops reading.
Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED,
because they match the line that says +^file:/E:/Index Samples/.
I
Hi Andy,
I don't know which version of Nutch you're using, but in 0.7, it's in
the Summary.Highlight class in org.apache.nutch.searcher. It should be
clear how to change it.
Regards,
David Wallace.
Date: Wed, 11 Jan 2006 15:20:05 -0500
From: Andy Morris [EMAIL PROTECTED]
To: nutch-user
I used to get this occasionally too, running Windows 2000. It looks to
me like one of the tools can sometimes fail, for whatever reason, but
leave some kind of OS-level locks on some of the files. This is
usually the UpdateDatabaseTool. On the FOLLOWING run of the
UpdateDatabaseTool, the files
Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
I'm using Nutch for an intranet style search engine, on a single
site, so I don't
. This is an effective way to limit the
size of the link database, keeping the only the highest quality
links.
/description
/property
... setting to false?
Stefan
Am 20.12.2005 um 00:49 schrieb David Wallace:
Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working
10 matches
Mail list logo