I'm having an interesting problem that I think revolves around the interplay
of a few settings that I'm not really clear on how they affect the crawl.
Currently I have:
content.limit = -1
fetcher.threads = 1000
fetcher.threads.per host = 100
indexer.max.tokens = 75
I also increased the JAVA
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:
http://mydomain.edu/staff/userid
I've added the staff URL to the urls seed file. But even with a crawl set to
Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.
It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl
I'm not sure what exactly changed that made all my nullpointer errors go
away, but I'm grateful for it, whatever it was.
So, +1 from me, not that I'm even sure I get a vote in the matter, but if
it's open to anyone on the list, I'm on board.
--
View this message in context:
http://n3.nabble.c
I have an old page on my site that Nutch is fetching. The results in the
Nutch web app look like this:
Site Map
... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by
Scientific Name ... Common Name Select
NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugs