On Wednesday 10 August 2011 10:17:18 jeffersonzhou wrote: > Hi, > > > > I have three questions and hope some can help answer them. > > > > 1. Is there a way to update, add or delete contents in crawlDB? I am > more interested in knowing the answer in distributed computing environment.
Is this Nutch 2.x of 1.x? In 1.x you can set a urlfilter to remove unwanted entries and update the db. > > 2. I have used Berkeley DB in standalone Nutch, and I want to use > Berkeley DB in distributed Nutch environment. How can I read from and write > to the Berkeley DB in HDFS? I don't follow? How do you use bdb in Nutch? Nutch 2.x with Gora doesn't support this either iirc. > > 3. I have stored some frequently used data in memory, how can the > data be accessed by all the nutch instances? Shared memory? I'd go for a memcached pool, much easier and works in a cluster instead of just a single machine. Take care, your mappers and reducers are not idempotent anymore if you read/write during such a phase. > > > > Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

