Hey Guys, Thanks for the useful details.
I have looked at the implementation and a minor review from myside, the getJSONString() should be moved to the common util location, I could not find the appropraite utility class. May be some CommonUtility class can hold the common methods at *apache-nutch-2.3.1/src/java/org/apache/nutch/util* The NUTCH-2132 seems to be emitting the events letting the Kafka about the fetching/parsing status of the URL's, I did look at the code. I have been using 2.3.1 version and the fix seems to be done for 1.3, so I may have to port the JIRA if I need to use this feature. Our requirement is little different, I would expect the parsed contents to be send to the Kafka in a specific format which we can define in avro schema. I have been using gobblin for ETL and have defined the schema for kafka messaging, check this http://gobblin.readthedocs.io/en/latest/Getting-Started/#other-example-jobs <http://http://gobblin.readthedocs.io/en/latest/Getting-Started/#other-example-jobs> I have got couple of ways to handle it 1) Consume the Kafka events which indicate the fetching is done, the consumer should parse the URL and extract the content and process it. 2) The Kafka Plugin needs to be modified so that the parsed contents too are published to Kafka, this way we are not making multiple calls to the site which is being crawled. However this will have a tradeoff of causing more consumtion of network bandwidth as the messages containing the contents will pass through the network. Let me hear more from you. Thanks, Vicky -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-to-send-data-to-Kafka-tp4312320p4312452.html Sent from the Nutch - User mailing list archive at Nabble.com.

