RE: Some questions regarding nutch in distributed computing environment

jeffersonzhou Wed, 10 Aug 2011 04:44:11 -0700

Markus,

I am using 1.3. For 1, I would like to modify, delete the contents already
in crawldb rather than that haven't injected into crawldb.

For 2, I am using Berkeley db to save interim results when parsing html
pages. They sit as jar file, and I create, read and write to these databases
on the fly

For 3, memcached pool is something that I may consider. But I am sure if I
know it well engouth. Please be more specific.

thanks 

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Wednesday, August 10, 2011 5:50 PM
To: [email protected]
Cc: jeffersonzhou
Subject: Re: Some questions regarding nutch in distributed computing
environment

On Wednesday 10 August 2011 10:17:18 jeffersonzhou wrote:
> Hi,
> 
> 
> 
> I have three questions and hope some can help answer them.
> 
> 
> 
> 1.       Is there a way to update, add or delete contents in crawlDB? I am
> more interested in knowing the answer in distributed computing
environment.

Is this Nutch 2.x of 1.x? In 1.x you can set a urlfilter to remove unwanted 
entries and update the db.

> 
> 2.       I have used Berkeley DB in standalone Nutch, and I want to use
> Berkeley DB in distributed Nutch environment. How can I read from and
write
> to the Berkeley DB in HDFS?

I don't follow? How do you use bdb in Nutch? Nutch 2.x with Gora doesn't 
support this either iirc.

> 
> 3.       I have stored some frequently used data in memory, how can the
> data be accessed by all the nutch instances?

Shared memory? I'd go for a memcached pool, much easier and works in a
cluster 
instead of just a single machine. Take care, your mappers and reducers are
not 
idempotent anymore if you read/write during such a phase.

> 
> 
> 
> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Some questions regarding nutch in distributed computing environment

Reply via email to