Hi Everyone, I am crawling approximately 5-7 domains using nutch 1.x running in distributed mode. I have couple of questions
1. Should I enable indexer to run for each iteration or do it as a final step after all the crawling finishes ? Will there be significant performance gains by indexing in the end ? 2. I want to add a few fields to each document from a different data store. Currently, I am planning to write the set of parsed fields for each document as a JSON document using a nutch plugin. Later, I can join the documents with the external data source using a spark job and then index it to elasticsearch. 3. Is it a good idea to read the crawldb in HDFS using a spark job and then join with some external datasource and write the documents to elasticsearch. This is an alternate idea to the previous approach. I am aware that nutch has plugins to write to elasticsearch. I am wondering whether decoupling indexing from nutch will give me more flexibility with the documents that i write to index. Also, curious which approach would be highly scalable and more performant. I am curious to hear what approach others took for similar use case. Thanks Srini

