Hi Srinivasan,

Comments in line.

Regards

----- Original Message -----
> From: "Srinivasan Ramaswamy" <[email protected]>
> To: [email protected]
> Sent: Thursday, June 15, 2017 4:40:44 AM
> Subject: [MASSMAIL]efficient way to create an index out of crawled documents 
> from nutch
> 
> Hi Everyone,
> 
> I am crawling approximately 5-7 domains using nutch 1.x running in
> distributed mode. I have couple of questions
> 
> 1. Should I enable indexer to run for each iteration or do it as a final
> step after all the crawling finishes ? Will there be significant
> performance gains by indexing in the end ?

You can index your documents as a final step without any problem, as long as 
you are able to know which are the not indexed segments. But I don't think that 
this approach brings with it an improvement in performance.

> 2. I want to add a few fields to each document from a different data store.
> Currently, I am planning to write the set of parsed fields for each
> document as a JSON document using a nutch plugin. Later, I can join the
> documents with the external data source using a spark job and then index it
> to elasticsearch.
> 
> 3. Is it a good idea to read the crawldb in HDFS using a spark job and then
> join with some external datasource and write the documents to
> elasticsearch. This is an alternate idea to the previous approach.

I suggest you to use the indexer-rabbit plugin to send the documents to a 
RabbitMQ server and then you can process them and add the fields that you need 
with some consumer implemented by yourself.

For your consumer I suggest you Spring Boot. It has starters to Rabbit and 
Elastic as well.

The plugin will be available in Nutch 1.14, but you can use it from here: 
https://github.com/apache/nutch/pull/168

> I am aware that nutch has plugins to write to elasticsearch. I am wondering
> whether decoupling indexing from nutch will give me more flexibility with
> the documents that i write to index. Also, curious which approach would be
> highly scalable and more performant.
> 
> I am curious to hear what approach others took for similar use case.
> 
> Thanks
> Srini
> 
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply via email to