Re: Document contents split among different Fields

2004-09-23 Thread Erik Hatcher
On Sep 23, 2004, at 6:00 PM, Greg Langmead wrote:
Doug Cutting wrote:
Do you need highlights from all fields?  If so, then you can use:
   TextFragment[] getBestTextFragments(TokenStream, ...);
with a TokenStream for each field, then select the highest scoring
fragments across all fields.  Would that work for you?
Thanks for the reply.  I can't find code like this in the lucene or
lucene-demo packages -- is this something implemented, or did you mean 
it as
an example?
TextFragment is part of the Highlighter contribution in the 
jakarta-lucene-sandbox CVS repository.  Check it out and you'll have 
what Doug is speaking of.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document contents split among different Fields

2004-09-23 Thread Greg Langmead
Doug Cutting wrote:
> Do you need highlights from all fields?  If so, then you can use:
> 
>TextFragment[] getBestTextFragments(TokenStream, ...);
> 
> with a TokenStream for each field, then select the highest scoring 
> fragments across all fields.  Would that work for you?

Thanks for the reply.  I can't find code like this in the lucene or
lucene-demo packages -- is this something implemented, or did you mean it as
an example?

Once I get a text fragment, are you proposing using it to do a secondary
search within the source document, to match the fragment?

I would like to do highlighting on content from either of my Fields, but I
think that even if I didn't I'd have the same problem, because I'll have
punched holes in the text Field and the positional data within the Field no
longer reflects the position in the source.

I think that if I want to pick the document apart into pieces like this,
then I need to do some work to restore global positional data, by
squirreling away the size of the holes I punch (the size of the XML islands,
from the text Field's point of view, and the size of the text runs, from the
island Field's point of view).  If I store a special textual escape within
the Field data that records the length of each gap, then I can read those
escapes when Tokenizing the Field and add the number stored therein to the
Token offset, restoring the global positional data.  Does that make sense?
I'm concerned this does violence to Lucene's model, which I've only been
studying for a couple of weeks now.

Greg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document contents split among different Fields

2004-09-23 Thread Doug Cutting
Greg Langmead wrote:
Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
"contents" Field?
No, I don't think so.
Has anyone tackled indexing multiple content Fields
before that could shed some light?
Do you need highlights from all fields?  If so, then you can use:
  TextFragment[] getBestTextFragments(TokenStream, ...);
with a TokenStream for each field, then select the highest scoring 
fragments across all fields.  Would that work for you?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Document contents split among different Fields

2004-09-23 Thread Greg Langmead
I am working on extending Lucene to support documents with special islands
of an XML language, and I want to index the islands differently from the
text.  My current plan is to break the document's contents into two Fields,
one with all the text and one with all the special islands, and use a
different Analyzer on each Field.

In heading down this road, I realized that this approach breaks the whole
model of Token as it supports highlighting.  Token seems designed to store
offsets within a given Field, so if you break a document up into pieces, the
offsets are meaningless in terms of the original source document.

Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
"contents" Field?  Has anyone tackled indexing multiple content Fields
before that could shed some light?

Thanks,
Greg Langmead
Design Science, Inc., "How Science Communicates"
http://www.dessci.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]