Hi people:

I'm thinking (just for now is a thought) about the possible integration about 
nutch and some queue messaging service (like RabbitMQ) the idea is to do some 
"offline" processing of some data crawled nutch (and indexed into solr). Let's 
take an example: I want to categorize the pages crawled by nutch using some 
Machine Learning techniques (and store the category or categories in a solr 
field), the main concern is that this is a potentially time-consuming task, so 
this would take a lot of time to do the parsing task (thinking in implementing 
a plugin for the categorization task), my approach is to crawl the URL and in 
the parsing phase send the required data into RabbitMQ, and with some workers 
do the actual categorization of the text. For this i need to send to RabbitMQ 
some source of id so when the categorization is done, the worker knows which 
document update in solr. Is possible to get id of a document sent to solr by 
nutch?

I'm thinking that if this is not possible then I should go the other way 
around: Take the document directly for Solr and forget about how it get's there 
(which would be nutch of course).

Any thoughts on this?

Greetings!

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Reply via email to