Hi Everyone,

I am crawling approximately 5-7 domains using nutch 1.x running in
distributed mode. I have couple of questions

1. Should I enable indexer to run for each iteration or do it as a final
step after all the crawling finishes ? Will there be significant
performance gains by indexing in the end ?

2. I want to add a few fields to each document from a different data store.
Currently, I am planning to write the set of parsed fields for each
document as a JSON document using a nutch plugin. Later, I can join the
documents with the external data source using a spark job and then index it
to elasticsearch.

3. Is it a good idea to read the crawldb in HDFS using a spark job and then
join with some external datasource and write the documents to
elasticsearch. This is an alternate idea to the previous approach.

I am aware that nutch has plugins to write to elasticsearch. I am wondering
whether decoupling indexing from nutch will give me more flexibility with
the documents that i write to index. Also, curious which approach would be
highly scalable and more performant.

I am curious to hear what approach others took for similar use case.

Thanks
Srini

Reply via email to