RE: Nutch 1.X with alternative storage

Markus Jelsma Fri, 16 Jun 2017 12:10:45 -0700

Hello,

It should not be too hard to make it work, all i can think of is that you need 
custom InputFormat and OutputFormat for your database and modify Nutch' main 
job starting classes to use your new InputFormat and OutputFormat.


Take care, this would mean that your database will be read and rewritten 
entirely on updatedb, every cycle. Each segment file and CrawlDb would be 
separate tables or indexes, whatever you use. If segment data is all you care 
about, you would only need to implement that of course, keeping the CrawlDb on 
disk.

Markus

 
 
-----Original message-----
> From:Zoltán Zvara <[email protected]>
> Sent: Friday 16th June 2017 18:28
> To: [email protected]
> Subject: Nutch 1.X with alternative storage
> 
> Dear Nutch Community,
> 
> I'm working on a PoC with Nutch 1.X, and also aware of 2.X and its features. 
> I'd like to use Nutch 1.X with an alternative storage, for example Couchbase. 
> Parsed documents would be pre-processed at a Parser extension point, analyzed 
> and a specific JSON schema would be sent - for example to Couchbase. However, 
> the content should not be present in Nutch's segment table.
> 
> In other words, how to use an external storage engine with Apache Nutch 1.X 
> to bypass Gora altogether, add a custom pre-processing before ingesting data 
> into external storage, and to remove any duplicates from the segment table?
> 
> I appreciate you help, thanks!
> 
> Regards,
> Zoltán

RE: Nutch 1.X with alternative storage

Reply via email to