Hi

Thank you for taking the time to reply my email, I'll really appreciate it.

> I'm thinking (just for now is a thought) about the possible integration
> about nutch and some queue messaging service (like RabbitMQ) the idea is to
> do some "offline" processing of some data crawled nutch (and indexed into
> solr). Let's take an example: I want to categorize the pages crawled by
> nutch using some Machine Learning techniques (and store the category or
> categories in a solr field), the main concern is that this is a potentially
> time-consuming task, so this would take a lot of time to do the parsing
> task (thinking in implementing a plugin for the categorization task), my
> approach is to crawl the URL and in the parsing phase send the required
> data into RabbitMQ, and with some workers do the actual categorization of
> the text. For this i need to send to RabbitMQ some source of id so when the
> categorization is done, the worker knows which document update in solr. Is
> possible to get id of a document sent to solr by nutch?
>

bear in mind that there are no updates in SOLR but complete re-writes of
the documents. You'd need therefore to have all the fields you want to
represent the documents in order to index them.

Yes I know about this, but sending the id and necessary data to the RabbitMQ, 
then I could retrieve all the fields from solr and "update" the document with 
the new data. But again, this is pretty much the same as getting all the 
required data from solr.


> I'm thinking that if this is not possible then I should go the other way
> around: Take the document directly for Solr and forget about how it get's
> there (which would be nutch of course).
>

Hmmm. That would put a lot of unnecessary load on your SOLR instances and
you;d need to store all the fields in order to re-send their values for
indexing.

I agree with you!

What you described sounds like a mix of Nutch and Storm (
http://storm-project.net/) which could be a good idea as you don't want to
reinvent the wheel with the queues. From my experience with Text
Classification this can be done as an indexing filer so that you simply
create a new field for the value returned by the model and does not add
that much time to the whole process - depending of course on the
implementation you use. I used it on a multimillion page crawl with a
liblinear model and the overhead was not large enough to justify a more
convoluted approach.

Well based on your experience this could be just what I needed, and for what I 
can see this method scales well, if you can handle pages around the 
multimillions.

Another way of doing would be to decouple the ML from the rest by using
Behemoth as a bridge and its text classification module (
https://github.com/DigitalPebble/behemoth-textclassification) combines with
the SOLR module. Basically you convert your segments into Behemoth
documents, do the text classif then index to SOLR. The advantage being that
this is decoupled from the crawl jobs and could run at the same time on
your cluster.

This worth checking out, any disadvantages / performance impact in converting 
the segments into Behemoth documents? 

Again and bearing in mind the famous quote about premature optimization I
would not rush to make things more complicated than necessary until you are
sure that this is really required. Having the text classification done as
an indexing plugin could be fine.

I'm a true believer of the KISS philosophy, but as I say in the previous 
message, this are just some random ideas, maybe for the not so near feature 
:-). But, if a see it through, I need to handle a very large amount of pages to 
crawl and process, and keeping in mind that the classification its not the only 
thing I'll need to do with the crawled pages, I was trying to decouple the 
process a bit! But storm seems to be a pretty good option, and should scale 
well, and the second option with Behemoth worth looking into it (if necessary, 
as you brilliantly pointed out).

You have any other thought/experience that would yo like to share with a newbie?

Greetings!

George!


HTH

Julien



*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Reply via email to