Hey guys, Checkout this thread I opened on nutch mailing list. Looks like Solr can benefit from reusing Nutch's "segment" based storage strategy for efficiency in returning snippets, summaries etc without using Lucene stored fields?
Was this considered before? Ravish ---------- Forwarded message ---------- From: Dennis Kubes <[EMAIL PROTECTED]> Date: Oct 11, 2007 11:27 PM Subject: Re: snippets and stored field in nutch... To: [EMAIL PROTECTED] The reason it is stored in the segments instead of index to allow summarizers to be run on the content of hits to produce the summaries that appear in the search results. Summarizers are pluggable and the actual content used to produce the summary can change. And summaries can be changed without re-fetching or re-indexing. If a summary were stored in the index, re-indexing would have to occur to make changes. Also the way the search process works, Nutch returns hits (basically document ids). These hits are then sorted and deduped and the best x number (usually 10) returned. For only these 10 best hits, hit details (fields in the index) and summaries are retrieved. So there is something to be said about the amount of data being pushed over the network. Dennis Kubes Ravish Bhagdev wrote: > Ah, I see, didn't know that, Thanks! > > Interesting that nutch stores it in a different structure (segments) > and doesn't reuse Lucene strategy of storing within index. Any > particular reason why? Is there any other use of "Segments" data > structure except to return snippets? > > Cheers, > Ravish > > On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote: >> Hi Ravish. >> >> You are correct that Nutch does not store document content in the >> Lucene index. The content *is* stored in the Nutch segment, which is >> where snippets come from. >> >> Hope this helps. >> >> -J >> >> >> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote: >> >>> Hey All, >>> >>> Am I right in believing that in Lucene/Nutch, to be able to return >>> content or snippet to a search query, the field to be returned has to >>> be stored? >>> >>> AFAIK, by default, Nutch dose not store the document field, am I >>> right? If so, how does it manage to return snippets? Wouldn't the >>> index be quite huge if nutch were storing document field by default? >>> >>> I will appreciate any help/comments as I'm bit lost with this. >>> >>> Ravi >>