Nutch tuning - speed improvements that worked for me

2006-12-20 Thread RP
Some tuning results - play with what you have and you might be surprised..!! A simple tweak to run Java as a server "-server" switch, gave a ~13% improvement as noted below for a readdb. The -server tweak did not help on query results via Tomcat but for basic Nutch DB work, it did pretty well

Nutch 0.9 logging to catalina.out fails

2006-12-20 Thread RP
No changes to logging configuration that worked fine at 0.8 but at 0.9 I get this once I do a query (query returns just fine): INFO: Server startup in 1947 ms log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: / (Is a directory) at java.io.FileOutputStream.open

Fun question for index merge

2006-12-20 Thread sdeck
Thanks for everyones help so far from my postings. Here is another question. I am currently merging my crawls, but am wondering if I can skip a few steps and how to do it. I inject a whole slew of urls into a crawl each time, and then merge it with the crawl previously to that. The urls injected

Re: 0.8 output\index versus output\indexes

2006-12-20 Thread liv
I've found this message while looking to update subcollections field upon a reindexing operation. I had no explanation for my issue: I fetched/indexed some sites, using subcollection.xml, then I made changes in the subcollection.xml and reindexed. While inspecting the db with luke, or using the we

Re: Need help with deleteduplicates

2006-12-20 Thread Dennis Kubes
If I am understanding what you are asking, in the getRecordReader method of the InputFormat innner class in DeleteDuplicates it gets the hash score from the document. You could put your algorithm there and return some type of numeric value based on analysis of the document fields. You would n

Re: Web interface problems

2006-12-20 Thread Andrzej Bialecki
Robin Haswell wrote: On Wed, 2006-12-20 at 12:38 +0100, Andrzej Bialecki wrote: This is the problem - you need to increase the heap space in your Tomcat. Since you expanded you index, the bigger index won't fit in the same heap space as before ... especially when you run searches that touch

Re: Web interface problems

2006-12-20 Thread Robin Haswell
On Wed, 2006-12-20 at 12:38 +0100, Andrzej Bialecki wrote: > This is the problem - you need to increase the heap space in your > Tomcat. Since you expanded you index, the bigger index won't fit in the > same heap space as before ... especially when you run searches that > touch more of the index

Re: Web interface problems

2006-12-20 Thread Andrzej Bialecki
Robin Haswell wrote: Hey there I'm having issues searching with my newly (vastly) expanded database. Could anyone shed any light on this? Basically, on a newly started server, I search for "test", and this appears in catalina.out: 2006-12-20 10:51:40,710 INFO NutchBean - creating new bean 2006

Web interface problems

2006-12-20 Thread Robin Haswell
Hey there I'm having issues searching with my newly (vastly) expanded database. Could anyone shed any light on this? Basically, on a newly started server, I search for "test", and this appears in catalina.out: 2006-12-20 10:51:40,710 INFO NutchBean - creating new bean 2006-12-20 10:51:40,725 INF