Hi Srinivasan, Comments in line.
Regards ----- Original Message ----- > From: "Srinivasan Ramaswamy" <[email protected]> > To: [email protected] > Sent: Thursday, June 15, 2017 4:40:44 AM > Subject: [MASSMAIL]efficient way to create an index out of crawled documents > from nutch > > Hi Everyone, > > I am crawling approximately 5-7 domains using nutch 1.x running in > distributed mode. I have couple of questions > > 1. Should I enable indexer to run for each iteration or do it as a final > step after all the crawling finishes ? Will there be significant > performance gains by indexing in the end ? You can index your documents as a final step without any problem, as long as you are able to know which are the not indexed segments. But I don't think that this approach brings with it an improvement in performance. > 2. I want to add a few fields to each document from a different data store. > Currently, I am planning to write the set of parsed fields for each > document as a JSON document using a nutch plugin. Later, I can join the > documents with the external data source using a spark job and then index it > to elasticsearch. > > 3. Is it a good idea to read the crawldb in HDFS using a spark job and then > join with some external datasource and write the documents to > elasticsearch. This is an alternate idea to the previous approach. I suggest you to use the indexer-rabbit plugin to send the documents to a RabbitMQ server and then you can process them and add the fields that you need with some consumer implemented by yourself. For your consumer I suggest you Spring Boot. It has starters to Rabbit and Elastic as well. The plugin will be available in Nutch 1.14, but you can use it from here: https://github.com/apache/nutch/pull/168 > I am aware that nutch has plugins to write to elasticsearch. I am wondering > whether decoupling indexing from nutch will give me more flexibility with > the documents that i write to index. Also, curious which approach would be > highly scalable and more performant. > > I am curious to hear what approach others took for similar use case. > > Thanks > Srini > La @universidad_uci es Fidel. Los jóvenes no fallaremos. #HastaSiempreComandante #HastalaVictoriaSiempre

