Re: Document contents split among different Fields
On Sep 23, 2004, at 6:00 PM, Greg Langmead wrote: Doug Cutting wrote: Do you need highlights from all fields? If so, then you can use: TextFragment[] getBestTextFragments(TokenStream, ...); with a TokenStream for each field, then select the highest scoring fragments across all fields. Would that work for you? Thanks for the reply. I can't find code like this in the lucene or lucene-demo packages -- is this something implemented, or did you mean it as an example? TextFragment is part of the Highlighter contribution in the jakarta-lucene-sandbox CVS repository. Check it out and you'll have what Doug is speaking of. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document contents split among different Fields
Doug Cutting wrote: > Do you need highlights from all fields? If so, then you can use: > >TextFragment[] getBestTextFragments(TokenStream, ...); > > with a TokenStream for each field, then select the highest scoring > fragments across all fields. Would that work for you? Thanks for the reply. I can't find code like this in the lucene or lucene-demo packages -- is this something implemented, or did you mean it as an example? Once I get a text fragment, are you proposing using it to do a secondary search within the source document, to match the fragment? I would like to do highlighting on content from either of my Fields, but I think that even if I didn't I'd have the same problem, because I'll have punched holes in the text Field and the positional data within the Field no longer reflects the position in the source. I think that if I want to pick the document apart into pieces like this, then I need to do some work to restore global positional data, by squirreling away the size of the holes I punch (the size of the XML islands, from the text Field's point of view, and the size of the text runs, from the island Field's point of view). If I store a special textual escape within the Field data that records the length of each gap, then I can read those escapes when Tokenizing the Field and add the number stored therein to the Token offset, restoring the global positional data. Does that make sense? I'm concerned this does violence to Lucene's model, which I've only been studying for a couple of weeks now. Greg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document contents split among different Fields
Greg Langmead wrote: Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic "contents" Field? No, I don't think so. Has anyone tackled indexing multiple content Fields before that could shed some light? Do you need highlights from all fields? If so, then you can use: TextFragment[] getBestTextFragments(TokenStream, ...); with a TokenStream for each field, then select the highest scoring fragments across all fields. Would that work for you? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document contents split among different Fields
I am working on extending Lucene to support documents with special islands of an XML language, and I want to index the islands differently from the text. My current plan is to break the document's contents into two Fields, one with all the text and one with all the special islands, and use a different Analyzer on each Field. In heading down this road, I realized that this approach breaks the whole model of Token as it supports highlighting. Token seems designed to store offsets within a given Field, so if you break a document up into pieces, the offsets are meaningless in terms of the original source document. Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic "contents" Field? Has anyone tackled indexing multiple content Fields before that could shed some light? Thanks, Greg Langmead Design Science, Inc., "How Science Communicates" http://www.dessci.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]