On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
> Hello,
>
> I'm indexing some content (articles) whose text I cannot store in its 
> original 
> form for copyright reason.  So I can index the content, but cannot store it.  
> However, I need snippets and search term highlighting.  
>
>
> Any way to accomplish this elegantly?  Or even not so elegantly?
>
> Here is one idea:
>
> * Create 2 indices: main index for indexing (but not storing) the original 
> content, the secondary index for storing individual sentences from the 
> original 
> article.
How about storing the sentences in the same index in a separate field
but with random ordering, would that be ok?

Tarjei
> * That is, before indexing an article, split it into sentences.  Then index 
> the 
> article in the main index, and index+store each sentence in the secondary 
> index.  So for each doc in the main index there will be multiple docs in the 
> secondary index with individual sentences.  Each sentence doc includes an ID 
> of 
> the "parent" document.
>
> * Then run queries against the main index, and pull individual sentences from 
> the secondary index for snippet+highlight purposes.
>
>
> The problem I see with this approach (and there may be other ones that I am 
> not 
> seeing yet) is with queries like foo AND bar.  In this case "foo" may be a 
> match 
> from sentence #1, and "bar" may be a match from sentence #7.  Or maybe "foo" 
> is 
> a match in sentence #1, and "bar" is a match in multiple sentences: #7 and 
> #10 
> and #23.
>
> Regardless, when a query is run against the main index, you don't know where 
> the 
> match was, so you don't know which sentences to go get from the secondary 
> index.
>
> Does anyone have any suggestions for how to handle this?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>


-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413

Reply via email to