Mike Klaas wrote:
On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:

Hi Mike,

Thanks for your reply :)

I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...

Let me add a few comments, as someone who is pretty familiar with Nutch.

Indeed, there is a strong separation of data stores in Nutch - in order to get the maximum possible performance Lucene indexes are not used for data storage - they contain only bare essentials needed to compute the score, plus an "id" of a data record stored elsewhere. Confusingly, this location is called "segment", and it consists of a bunch of Hadoop MapFile-s and SequenceFile-s - there are data files with "content", "parse_data" and "parse_text" among others.

When results are returned to the client (in this case - Nutch front-end machine) they contain only the score and this id (plus optionally some other data needed for online de-duplication). In other words, Nutch doesn't transmit the whole "document" to the client, only the parts that are needed to prepare the presentation of the requested portion of hits.

Nutch stores plain text versions of documents in segments, in the "parse_text" file, and retrieves this data on demand, i.e. when a client requests a summary to be presented. Nutch front-end uses Hadoop RPC to communicate with back-end servers, and can retrieve either one or several summaries in one call, which reduces the network traffic.

In a similar way the original binary content of a document can be requested if needed, and it will be retrieved from the "content" MapFile in a "segment".

The advantage of this approach is that you can keep the index size to a minimum (it contains mostly unstored fields), and that you can associate arbitrary binary data with a Lucene document. The downside is the increased cost to manage many data files - but this cost is largely hidden in Nutch behind specialized *Reader facades.


It doesn't slow down querying, but it does slow down document retrieval (*if you are never going to request the summaries for those documents). That is the case I was referring to below.

This is the case for which Nutch architecture is optimized.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to