Thanks Markus, As you have mentioned, I only care about the segment data, and to avoid having the raw content and parsed content not to be present there, only in my external data storage. I've gone through the documentation of Nutch and looked into the code as well, but did not find the SegmentWriter - actually I do, a code written 12 years ago by Doug Cutting, that is not present now.
Could you point me to the code that should be modified? Other concerns also arise, like does Nutch use the content and parsed text for other purposes when the segment is accessed again in the future? Zoltán On 2017-06-16 21:10:33, Markus Jelsma <[email protected]> wrote: Hello, It should not be too hard to make it work, all i can think of is that you need custom InputFormat and OutputFormat for your database and modify Nutch' main job starting classes to use your new InputFormat and OutputFormat. Take care, this would mean that your database will be read and rewritten entirely on updatedb, every cycle. Each segment file and CrawlDb would be separate tables or indexes, whatever you use. If segment data is all you care about, you would only need to implement that of course, keeping the CrawlDb on disk. Markus -----Original message----- > From:Zoltán Zvara > Sent: Friday 16th June 2017 18:28 > To: [email protected] > Subject: Nutch 1.X with alternative storage > > Dear Nutch Community, > > I'm working on a PoC with Nutch 1.X, and also aware of 2.X and its features. > I'd like to use Nutch 1.X with an alternative storage, for example Couchbase. > Parsed documents would be pre-processed at a Parser extension point, analyzed > and a specific JSON schema would be sent - for example to Couchbase. However, > the content should not be present in Nutch's segment table. > > In other words, how to use an external storage engine with Apache Nutch 1.X > to bypass Gora altogether, add a custom pre-processing before ingesting data > into external storage, and to remove any duplicates from the segment table? > > I appreciate you help, thanks! > > Regards, > Zoltán

