Re: Some questions regarding nutch in distributed computing environment

Markus Jelsma Wed, 10 Aug 2011 04:48:45 -0700


On Wednesday 10 August 2011 13:43:14 jeffersonzhou wrote:
> Markus,
> 
> I am using 1.3. For 1, I would like to modify, delete the contents already
> in crawldb rather than that haven't injected into crawldb.


You can only use the updatedb job to mutate contents. You either add items 
from a segment or filter out items using a url filter. Modifying individual 
records is very hard, there is no API to mutate all fields of one record.

> 
> For 2, I am using Berkeley db to save interim results when parsing html
> pages. They sit as jar file, and I create, read and write to these
> databases on the fly
> 
> For 3, memcached pool is something that I may consider. But I am sure if I
> know it well engouth. Please be more specific.

It's very easy. Fire up a few memcache nodes and point from your client 
towards them. Easy get and set of objects. Should have a Java client too but i 
never used it from Java.

> 
> thanks
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Wednesday, August 10, 2011 5:50 PM
> To: [email protected]
> Cc: jeffersonzhou
> Subject: Re: Some questions regarding nutch in distributed computing
> environment
> 
> On Wednesday 10 August 2011 10:17:18 jeffersonzhou wrote:
> > Hi,
> > 
> > 
> > 
> > I have three questions and hope some can help answer them.
> > 
> > 
> > 
> > 1.       Is there a way to update, add or delete contents in crawlDB? I
> > am more interested in knowing the answer in distributed computing
> 
> environment.
> 
> Is this Nutch 2.x of 1.x? In 1.x you can set a urlfilter to remove unwanted
> entries and update the db.
> 
> > 2.       I have used Berkeley DB in standalone Nutch, and I want to use
> > Berkeley DB in distributed Nutch environment. How can I read from and
> 
> write
> 
> > to the Berkeley DB in HDFS?
> 
> I don't follow? How do you use bdb in Nutch? Nutch 2.x with Gora doesn't
> support this either iirc.
> 
> > 3.       I have stored some frequently used data in memory, how can the
> > data be accessed by all the nutch instances?
> 
> Shared memory? I'd go for a memcached pool, much easier and works in a
> cluster
> instead of just a single machine. Take care, your mappers and reducers are
> not
> idempotent anymore if you read/write during such a phase.
> 
> > Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Some questions regarding nutch in distributed computing environment

Reply via email to