Add new content on the fly!

2006-04-08 Thread Goldschmidt, Dave
Hello, Sorry if this topic has arisen before, but we're trying to enhance Nutch to accept on-the-fly injections of new content. In other words, we have a crawler that feeds page injection commands to an HTTP server - this server, in turn, adds the URL to the crawldb (if necessary), generates

RE: Scaling Nutch 0.8 via Map/Reduce

2006-01-06 Thread Goldschmidt, Dave
avoiding duplicates. CC- -Original Message- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 4:10 PM To: nutch-user@lucene.apache.org Subject: RE: Scaling Nutch 0.8 via Map/Reduce Thanks for the response. Having a bunch of 50GB segments is more

RE: upgrade to version 0.8

2006-01-04 Thread Goldschmidt, Dave
If you're not worried about scaling up, I think version 0.7.x should be just fine. I've been working on migrating an 0.7.1 prototype (integrated with existing infrastructure) to 0.8 -- it's not trivial. :-) The file and folder structures have changed. DaveG -Original Message- From:

RE: Scaling Nutch 0.8 via Map/Reduce

2006-01-04 Thread Goldschmidt, Dave
that 50GB or so, and do fast recovery via an sync process and a check daemon. NOTE: At the time we built our solution NDFS was not production quality -- not sure where things stand now. -Original Message- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 3

RE: Help! 0.7.1 segments don't work in 0.8

2005-12-23 Thread Goldschmidt, Dave
that transform things from one format to the other, but require some java and nutch skills. I suggest just start from scratch. Stefan Am 23.12.2005 um 16:51 schrieb Goldschmidt, Dave: Hello, I know I've posted this question before, but I've had no success in using a 70GB 0.7.1 segment

RE: Help! 0.7.1 segments don't work in 0.8

2005-12-23 Thread Goldschmidt, Dave
Hello, yes, I think 0.7.x without NDFS and Map/Reduce should be fine for one-machine applications. You should be able to set up Nutch without NDFS or Map/Reduce by setting the fs.default.name and mapred.job.tracker properties to local (I think these may already be the default). HTH, DaveG

Re: java.net.ConnectException: Connection refused

2005-12-21 Thread Goldschmidt, Dave
Marko, thanks also from me! Just found this in the archives DaveG Hi Mike, Exception in thread main java.lang.NullPointerException at org.apache.nutch.mapred.JobTrackerInfoServer.init (JobTrackerInfoServer.java:67) at This is a little bug. You must copy

RE: build instructions?

2005-12-19 Thread Goldschmidt, Dave
Hello, I ran into the same problem (which I think is fixed in future releases). For Nutch 0.7.1, just create the missing directories and run the ant script again. HTH, DaveG -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Monday, December 19, 2005 2:38 PM

RE: After mergesegs

2005-12-15 Thread Goldschmidt, Dave
it searchable. updatedb doesn't affect the segments. it just updates the webdb with the results of the fetch. so if you already ran updatedb on the pre-merged segments, you don't need to run it again. On 12/8/05, Goldschmidt, Dave [EMAIL PROTECTED] wrote: Hi, just wanted to be sure - after I

After mergesegs

2005-12-08 Thread Goldschmidt, Dave
Hi, just wanted to be sure - after I merge segments via the mergesegs tool, I need to use the updatedb tool before dropping the new indexes in, correct? And, as just posted, I need to shutdown and restart Tomcat, too, yes? Thanks, DaveG

merge vs. updatedb

2005-12-06 Thread Goldschmidt, Dave
Hello, I'm currently aiming for a ~7M page index - but what exactly am I aiming for? :-) In other words, should I end up with only ONE Nutch segment when I'm done? Currently, I have 52. Having one Nutch segment will be quickest for searches? If so, should I perform the merge

RE: Speed of indexing

2005-12-06 Thread Goldschmidt, Dave
05.12.2005 um 22:24 schrieb Goldschmidt, Dave: Hello, In searching for solutions, I found an old post from Doug on tuning these parameters -- but this old message applied to ~30,000 documents only: http://marc.theaimsgroup.com/?l=lucene-userm=110235452004799w=2 I've upped both the mergeFactor

Speed of indexing

2005-12-05 Thread Goldschmidt, Dave
Hello, I'm currently indexing ~50 segments, each ~2GB in size, for a total of only ~7,000,000 pages. From the log output, I see an index rate of ~72 records/second. Doing the math, this is over 24 hours of time to index these segments. Does this sound slow? If so, any suggestions as to

RE: Speed of indexing

2005-12-05 Thread Goldschmidt, Dave
of your nutch-site settings yet? -byron --- Goldschmidt, Dave [EMAIL PROTECTED] wrote: Hello, I'm currently indexing ~50 segments, each ~2GB in size, for a total of only ~7,000,000 pages. From the log output, I see an index rate of ~72 records/second. Doing the math, this is over 24

RE: Map Reduce

2005-09-27 Thread Goldschmidt, Dave
Hello, MapReduce is described on Nutch's Wiki: http://wiki.apache.org/nutch/Presentations Specifically: http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/ mapred.pdf Hope this helps, DaveG -Original Message- From: Gal Nitzan [mailto:[EMAIL PROTECTED] Sent: