Re: solr, snippets and stored field in nutch...

Andrzej Bialecki Mon, 15 Oct 2007 06:58:55 -0700

Mike Klaas wrote:

On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:

Hi Mike,

Thanks for your reply :)

I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...


Let me add a few comments, as someone who is pretty familiar with Nutch.

Indeed, there is a strong separation of data stores in Nutch - in orderto get the maximum possible performance Lucene indexes are not used fordata storage - they contain only bare essentials needed to compute thescore, plus an "id" of a data record stored elsewhere. Confusingly, thislocation is called "segment", and it consists of a bunch of HadoopMapFile-s and SequenceFile-s - there are data files with "content","parse_data" and "parse_text" among others.

When results are returned to the client (in this case - Nutch front-endmachine) they contain only the score and this id (plus optionally someother data needed for online de-duplication). In other words, Nutchdoesn't transmit the whole "document" to the client, only the parts thatare needed to prepare the presentation of the requested portion of hits.

Nutch stores plain text versions of documents in segments, in the"parse_text" file, and retrieves this data on demand, i.e. when a clientrequests a summary to be presented. Nutch front-end uses Hadoop RPC tocommunicate with back-end servers, and can retrieve either one orseveral summaries in one call, which reduces the network traffic.

In a similar way the original binary content of a document can berequested if needed, and it will be retrieved from the "content" MapFilein a "segment".

The advantage of this approach is that you can keep the index size to aminimum (it contains mostly unstored fields), and that you can associatearbitrary binary data with a Lucene document. The downside is theincreased cost to manage many data files - but this cost is largelyhidden in Nutch behind specialized *Reader facades.

It doesn't slow down querying, but it does slow down document retrieval(*if you are never going to request the summaries for those documents).That is the case I was referring to below.


This is the case for which Nutch architecture is optimized.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: solr, snippets and stored field in nutch...

Reply via email to