On 6/22/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: It does seem like it would be easier for clients to parse document : associated data if it is included directly in the <doc> element. I acctually like the idea that it's included seperately ... it's really not that much harder to get at then if it's in the individual documents, and it makes it really easy to differentiate beteen "stored fields" of the document and "highlighted" info about the document .. especially if highlighting can be applied to non stored fields using TermVectors.
I'm inclined to agree. Note: term vectors are not sufficient in themselves to produce a highlit fragment. Hightlighter does not have the support. It could be added, but as they not include punctuation or whitespace, and the tokens they produce aren't always asthetically-pleasing (eg. they may be stemmed words)., the summaries may look a little strange. More useful would be to emit a list of document offsets rather than summaries; these can be used by an external application to extract summaries.
It also allows the highlighting section of the response to include a lot of extra data about the highlighted snippets, that would be cumbersome to try and fit into the <doc>. I started hypothisizing down this road in this old message... http://www.nabble.com/Re%3A-highlighting-p3954083.html ...but didn't really get to some of the crazier things you could do with it (like reporting back where in the document a snippet starts)
Something along these lines seems reasonable (that we came up with near-identical schema reinforces that). I originally had a list per field for multiple fragments as well, though scrapped it for simplicity. Does breaking down the highlit segments give significantly more power to the user over simply allowing a custom Formatter?
: I'm not sure if this is really the property of a field. : Another possibility is using init params in the request handler : defined in solrconfig.xml, with the possibility of overriding them in : a request. I agree with Yonik .. it might be usefull if there was a "suggested higherlighter configuration" at the Field/FiledType level ... but this really seems like a RequestHandler configue option to me (where hte RequestHandler can decide wether to have a query time option to override it'se behavior). That way you can have one instance of the XyzRequestHandler which does highlighting on the "title" field, and another instance with different init params that does highlighting on both the "title" and "summary" fields, and another with different init params that does summarizing/highlighting accross the title/summary and body fields only returning the most relevent snippets (where there can be snippet weighting based on field importance or something) those should all be up to the person configuring the way the queries work -- not the guy designing the schema.
Not unreasonable. Any objections to augmenting StandardRequestHandler with the ability to store config-time param defaults (as DisMax does currently)? <>
Assuming that's an invarient, you could add an option to the request handler to use a custom analyzer for the purposes of highlighting stored fields (independed of the field type) ... that doesn't really help the TermVectors situation, but assuming that invarient the onlything that can help you hear is using an indexing analyzer that doesn't produce multiple tokens at the same position.
It's actually less of a problem with term vectors as their use by Highlighter chooses only one token among the possibilities. I'll see if I can get that fixed in lucene. Should I submit a patch as a starting point? -Mike