Re: highlighting/summarizing and solr

Mike Klaas Thu, 22 Jun 2006 15:33:17 -0700

On 6/22/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: It does seem like it would be easier for clients to parse document
: associated data if it is included directly in the <doc> element.

I acctually like the idea that it's included seperately ... it's really
not that much harder to get at then if it's in the individual documents,
and it makes it really easy to differentiate beteen "stored fields" of the
document and "highlighted" info about the document .. especially if
highlighting can be applied to non stored fields using TermVectors.


I'm inclined to agree.  Note: term vectors are not sufficient in
themselves to produce a highlit fragment.  Hightlighter does not have
the support.  It could be added, but as they not include punctuation
or whitespace, and the tokens they produce aren't always
asthetically-pleasing (eg. they may be stemmed words)., the summaries
may look a little strange.

More useful would be to emit a list of document offsets rather than
summaries; these can be used by an external application to extract
summaries.

It also allows the highlighting section of the response to include a lot
of extra data about the highlighted snippets, that would be cumbersome to
try and fit into the <doc>.  I started hypothisizing down this road in
this old message...
        http://www.nabble.com/Re%3A-highlighting-p3954083.html
...but didn't really get to some of the crazier things you could do with
it (like reporting back where in the document a snippet starts)


Something along these lines seems reasonable (that we came up with
near-identical schema reinforces that).  I originally had a list per
field for multiple fragments as well, though scrapped it for
simplicity.

Does breaking down the highlit segments give significantly more power
to the user over simply allowing a custom Formatter?

: I'm not sure if this is really the property of a field.
: Another possibility is using init params in the request handler
: defined in solrconfig.xml, with the possibility of overriding them in
: a request.

I agree with Yonik .. it might be usefull if there was a "suggested
higherlighter configuration" at the Field/FiledType level ...  but this
really seems like a RequestHandler configue option to me (where hte
RequestHandler can decide wether to have a query time option to override
it'se behavior).  That way you can have one instance of the
XyzRequestHandler which does highlighting on the "title" field, and
another instance with different init params that does highlighting on both
the "title" and "summary" fields, and another with different init params
that does summarizing/highlighting accross the title/summary and body
fields only returning the most relevent snippets (where there can be
snippet weighting based on field importance or something)

those should all be up to the person configuring the way the queries work
-- not the guy designing the schema.


Not unreasonable.  Any objections to augmenting StandardRequestHandler
with the ability to store config-time param defaults (as DisMax does
currently)?

<>

Assuming that's an invarient, you could add an option to the request
handler to use a custom analyzer for the purposes of highlighting stored
fields (independed of the field type) ... that doesn't really help the
TermVectors situation, but assuming that invarient the onlything that
can help you hear is using an indexing analyzer that doesn't produce
multiple tokens at the same position.


It's actually less of a problem with term vectors as their use by
Highlighter chooses only one token among the possibilities.  I'll see
if I can get that fixed in lucene.

Should I submit a patch as a starting point?
-Mike

Re: highlighting/summarizing and solr

Reply via email to