On Wednesday 10 August 2011 13:43:14 jeffersonzhou wrote: > Markus, > > I am using 1.3. For 1, I would like to modify, delete the contents already > in crawldb rather than that haven't injected into crawldb.
You can only use the updatedb job to mutate contents. You either add items from a segment or filter out items using a url filter. Modifying individual records is very hard, there is no API to mutate all fields of one record. > > For 2, I am using Berkeley db to save interim results when parsing html > pages. They sit as jar file, and I create, read and write to these > databases on the fly > > For 3, memcached pool is something that I may consider. But I am sure if I > know it well engouth. Please be more specific. It's very easy. Fire up a few memcache nodes and point from your client towards them. Easy get and set of objects. Should have a Java client too but i never used it from Java. > > thanks > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Wednesday, August 10, 2011 5:50 PM > To: [email protected] > Cc: jeffersonzhou > Subject: Re: Some questions regarding nutch in distributed computing > environment > > On Wednesday 10 August 2011 10:17:18 jeffersonzhou wrote: > > Hi, > > > > > > > > I have three questions and hope some can help answer them. > > > > > > > > 1. Is there a way to update, add or delete contents in crawlDB? I > > am more interested in knowing the answer in distributed computing > > environment. > > Is this Nutch 2.x of 1.x? In 1.x you can set a urlfilter to remove unwanted > entries and update the db. > > > 2. I have used Berkeley DB in standalone Nutch, and I want to use > > Berkeley DB in distributed Nutch environment. How can I read from and > > write > > > to the Berkeley DB in HDFS? > > I don't follow? How do you use bdb in Nutch? Nutch 2.x with Gora doesn't > support this either iirc. > > > 3. I have stored some frequently used data in memory, how can the > > data be accessed by all the nutch instances? > > Shared memory? I'd go for a memcached pool, much easier and works in a > cluster > instead of just a single machine. Take care, your mappers and reducers are > not > idempotent anymore if you read/write during such a phase. > > > Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

