Re: Some questions regarding nutch in distributed computing environment

Markus Jelsma Wed, 10 Aug 2011 02:49:22 -0700


On Wednesday 10 August 2011 10:17:18 jeffersonzhou wrote:
> Hi,
> 
> 
> 
> I have three questions and hope some can help answer them.
> 
> 
> 
> 1.       Is there a way to update, add or delete contents in crawlDB? I am
> more interested in knowing the answer in distributed computing environment.


Is this Nutch 2.x of 1.x? In 1.x you can set a urlfilter to remove unwanted 
entries and update the db.

> 
> 2.       I have used Berkeley DB in standalone Nutch, and I want to use
> Berkeley DB in distributed Nutch environment. How can I read from and write
> to the Berkeley DB in HDFS?

I don't follow? How do you use bdb in Nutch? Nutch 2.x with Gora doesn't 
support this either iirc.

> 
> 3.       I have stored some frequently used data in memory, how can the
> data be accessed by all the nutch instances?

Shared memory? I'd go for a memcached pool, much easier and works in a cluster 
instead of just a single machine. Take care, your mappers and reducers are not 
idempotent anymore if you read/write during such a phase.

> 
> 
> 
> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Some questions regarding nutch in distributed computing environment

Reply via email to