On Nov 17, 2007 2:18 PM, Tricia Williams <[EMAIL PROTECTED]> wrote:
>     I was wondering how Solr people feel about the inclusion of Payload
> functionality in the Solr codebase?

All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

>     From a recent message to the [EMAIL PROTECTED] mailing list:
> >   I'm working on the issue
> > https://issues.apache.org/jira/browse/SOLR-380 which is a feature
> > request that allows one to index a "Structured Document" which is
> > anything that can be represented by XML in order to provide more
> > context to hits in the result set.  This allows us to do things like
> > query the index for "Canada" and be able to not only say that that
> > query matched a document titled "Some Nonsense" but also that the
> > query term appeared on page 7 of chapter 1.  We can then take this one
> > step further and markup/highlight the image of this page based on our
> > OCR and position hit.
> > For example:
> >
> > <book title='Some Nonsense'><chapter title='One'><page name='1'>Some
> > text from page one of a book.</page><page name='7'>Some more text from
> > page seven of a book. Oh and I'm from Canada.</page></chapter></book>
> >
> >   I accomplished this by creating a custom Tokenizer which strips the
> > xml elements and stores them as a Payload at each of the Tokens
> > created from the character data in the input.  The payload is the
> > string that describes the XPath at that location.  So for <Canada> the
> > payload is "/book[title='Some
> > Nonsense']/chapter[title='One']/page[name='7']"

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

>     Using Payloads requires me to include lucene-core-2.3-dev.jar  which
> might be a barrier.  Also, using my Tokenizer with Solr specific
> TokenFilter(s) looses the Payload at modified tokens.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

-Yonik

Reply via email to