Thanks Markus,

As you have mentioned, I only care about the segment data, and to avoid having 
the raw content and parsed content not to be present there, only in my external 
data storage. I've gone through the documentation of Nutch and looked into the 
code as well, but did not find the SegmentWriter - actually I do, a code 
written 12 years ago by Doug Cutting, that is not present now.

Could you point me to the code that should be modified? Other concerns also 
arise, like does Nutch use the content and parsed text for other purposes when 
the segment is accessed again in the future?

Zoltán

On 2017-06-16 21:10:33, Markus Jelsma <[email protected]> wrote:
Hello,

It should not be too hard to make it work, all i can think of is that you need 
custom InputFormat and OutputFormat for your database and modify Nutch' main 
job starting classes to use your new InputFormat and OutputFormat.

Take care, this would mean that your database will be read and rewritten 
entirely on updatedb, every cycle. Each segment file and CrawlDb would be 
separate tables or indexes, whatever you use. If segment data is all you care 
about, you would only need to implement that of course, keeping the CrawlDb on 
disk.

Markus



-----Original message-----
> From:Zoltán Zvara
> Sent: Friday 16th June 2017 18:28
> To: [email protected]
> Subject: Nutch 1.X with alternative storage
>
> Dear Nutch Community,
>
> I'm working on a PoC with Nutch 1.X, and also aware of 2.X and its features. 
> I'd like to use Nutch 1.X with an alternative storage, for example Couchbase. 
> Parsed documents would be pre-processed at a Parser extension point, analyzed 
> and a specific JSON schema would be sent - for example to Couchbase. However, 
> the content should not be present in Nutch's segment table.
>
> In other words, how to use an external storage engine with Apache Nutch 1.X 
> to bypass Gora altogether, add a custom pre-processing before ingesting data 
> into external storage, and to remove any duplicates from the segment table?
>
> I appreciate you help, thanks!
>
> Regards,
> Zoltán

Reply via email to