Re: Not storing, but highlighting from document sentences

Otis Gospodnetic Wed, 12 Jan 2011 08:23:14 -0800

Hi Steve,



----- Original Message ----
> From: Steven A Rowe <sar...@syr.edu>
> Subject: RE: Not storing, but highlighting from document sentences
> 
> I think you can get what you want by doing the first stage  retrieval, and 
> then 
>in the second stage, add required constraint(s) to the query  for the matching 
>docid(s), and change the AND operators in the original query to  OR.  
>Coordination will cause the best snippet(s) to rise to the top,  no?

Right, right.
So if the original query is: foo AND bar, I'd run it against the main index, 
get 
top N hits, say N=10.
Then I'd create another query: +(foo OR bar) +articleID:(ORed list of top N 
article IDs from main results)
And then I'd use that to get enough "sentence docs" to have at least 1 of them 
for each hit from the main index.

Hm, I wonder what happens when instead of simple foo AND bar you have a more 
complex query with more elaborate grouping and such...


> Hmm, you'll want to run the second stage once for each hit from the  first 
>stage, though, unless you can afford to collect *all* hits and pull out  each 
>first stage's hit from the intermixed second stage  results...

Wouldn't the above get me all sentences I need for top N hits from the main 
result in a single shot, assuming I use high enough rows=NNN to minimize the 
possibility of not getting even 1 sentence for any one of those top N hits?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/ 

> Steve
> 
> > -----Original Message-----
> > From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> >  Sent: Wednesday, January 12, 2011 7:29 AM
> > To: solr-user@lucene.apache.org
> >  Subject: Re: Not storing, but highlighting from document sentences
> > 
> > Hi Stefan,
> > 
> > Yes, splitting in separate sentences (and  storing them) is OK because with
> > a
> > bunch of sentences you can't  really reconstruct the original article
> > unless you
> > know which  order to put them in.
> > 
> > Searching against the sentence won't work  for queries like foo AND bar
> > because
> > this should match original  articles even if foo and bar are in different
> > sentences.
> > 
> > Otis
> > 
> > 
> > 
> > ----- Original Message  ----
> > > From: Stefan Matheis <matheis.ste...@googlemail.com>
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Wed, January 12, 2011 7:02:46 AM
> > > Subject: Re: Not  storing, but highlighting from document sentences
> > >
> > >  Otis,
> > >
> > > just interested in .. storing the full text is  not allowed, but
> > splitting up
> > > in separate sentences is  okay?
> > >
> > > while you think about  using the sentences  only as secondary/additional
> > > source, maybe it would help  to  search in the sentences itself, or would
> > that
> > > give  misleading results in  your case?
> > >
> > > Stefan
> >  >
> > > On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic  <
> > > otis_gospodne...@yahoo.com>   wrote:
> > >
> > > > Hello,
> > > >
> > >  > I'm indexing some content (articles)  whose text I cannot store in  its
> > > > original
> > > > form for copyright   reason.  So I can index the content, but cannot
> > store
> > >  > it.
> > > >  However, I need snippets and search term  highlighting.
> > > >
> > > >
> > > >  Any  way to accomplish this elegantly?  Or even not so  elegantly?
> >  > >
> > > > Here is one idea:
> > > >
> > >  > * Create 2 indices:  main index for indexing (but not storing)  the
> > original
> > > > content, the  secondary index for  storing individual sentences from
> > the
> > > >   original
> > > > article.
> > > >
> > > > * That  is, before indexing an article,  split it into sentences.   Then
> > index
> > > > the
> > > > article in the   main index, and index+store each sentence in the
> > secondary
> > >  > index.   So for each doc in the main index there will be multiple  docs
> > in
> > > >  the
> > > > secondary index  with individual sentences.  Each sentence doc
> > includes an
> >  > > ID of
> > > > the "parent" document.
> > >  >
> > > > * Then  run queries against the main index, and pull  individual
> > sentences
> > > >  from
> > > > the  secondary index for snippet+highlight  purposes.
> > > >
> >  > >
> > > > The problem I see with this approach (and   there may be other ones
> > that I am
> > > > not
> > >  > seeing yet) is with  queries like foo AND bar.  In this case  "foo" may
> > be a
> > > >  match
> > > > from  sentence #1, and "bar" may be a match from sentence #7.   Or
> >  maybe
> > > > "foo" is
> > > > a match in sentence #1, and  "bar" is a match  in multiple sentences:
> > #7 and
> > > >  #10
> > > > and #23.
> > > >
> > > >   Regardless, when a query is run against the main index, you don't
> >  know
> > > >  where the
> > > > match was, so you don't  know which sentences to go get from  the
> > secondary
> > >  > index.
> > > >
> > > > Does anyone have any  suggestions  for how to handle this?
> > > >
> > > >  Thanks,
> > > > Otis
> > > >  ----
> > > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > > Lucene  ecosystem  search :: http://search-lucene.com/ 
> > > >
> > > >
> >  >
>

Re: Not storing, but highlighting from document sentences

Reply via email to