Re: Solr and UIMA

Jussi Arpalahti Fri, 24 Jul 2009 01:47:55 -0700

2009/7/23 Grant Ingersoll <gsing...@apache.org>:
>
> On Jul 21, 2009, at 11:57 AM, JCodina wrote:
>
>>
>> Hello, Grant,
>> there are two ways, to implement this, one is payloads, and the other one
>> is
>> multiple tokens at the same positions.
>> Each of them can be useful, let me explain the way I thick they can be
>> used.
>> Payloads : every token has extra information that can be used in the
>> processing , for example if I can add Part-of-speech then I can develop
>> tokenizers that take into account the POS (or for example I can generate
>> bigrams of Noum Adjective, or Noum prep Noum i can have a better stopwords
>> algorithm....)
>>
>> Multiple tokes in one position: If I can have  different tokens at the
>> same
>> place, I can have different informations like: "was #verb _be" so I can do
>> a
>> search for "you _be #adjective" to find all the sentences that talk about
>> "you" for example "you were clever" "you are tall" ......
>
> This was one of the use cases for payloads as well, but it likely needs more
> Query support at the moment, as the BoostingTermQuery would only allow you
> to boost values where it's a verb, not include/exclude.
>
>>
>>
>> I have not understood the way that the
>>  DelimitedPayloadTokenFilterFactory
>> may work in solr, which is the input format?
>
> the DPTFF (nice acronym, eh?) allows you to send in your normal Solr XML,
> but with payloads encoded in the text.  For instance:
>
> <field name="foo">the quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ
> brown|JJ dogs|NN</field>
>
> The DPTFF will take the value before the delimiter as the Token and the
> value after the delimiter as the payload.  This then allows Solr to add
> Payloads without modifying a single thing in Solr, at least on the indexing
> side.
>
>>
>> so I was thinking in generating an xml where for each token a single
>> string
>> is generated like "was#verb#be"
>> and then there is a tokenfilter that splits by # each white space
>> separated
>> string,  in this case  in three words and adds the trailing character that
>> allows to search for the right semantic info. But gives them the same
>> increment. Of course the full processing chain must be aware of this.
>> But I must think on multiwords tokens
>>
>
> We could likely make a generic TokenFilter that can capture both multiple
> tokens and payloads all at the same time, simply by allowing it to have to
> attributes:
> 1. token delimiter (#)
> 2. payload delimiter (|)
>
> Then, you could do something like:
> was#be|verb
> or
> was#be|0.3
>
> where "was" and "be" are both tokens at the same position and "verb" or
> "0.3" are payloads on those tokens.  This is a nearly trivial variation of
> the DelimitedPayloadTokenFilter
>


Hi.

Apologies if I'm hijacking the thread.. I for one would very much like
this behaviour when indexing XML documents. I have a requirement to
get matching field's XPath location in the document. I currently
generate index like this:

some_field: {{ payload "//p[1]" }} actual text content of first p element

Then I strip "payload" part with custom filter (before other, "normal"
filters), but store the text with "payload" part. Then client side
gets XPath and user can choose to fetch matched part from found
document. User of course sees actual text with highlighting, "payload"
part removed. I think Lucene's payload mechanism would be better fit
for this, but being not too compenent with Java I developed this hack.
It does make client side parsing that much more difficult..

Of course payload would need to find it's way to Solr's query response
XML somehow.

Thank you.


Jussi Arpalahti

>
>
>
>
>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>>
>>> On Jul 20, 2009, at 6:43 AM, JCodina wrote:
>>>
>>>> D: Break things down. The CAS would only produce XML that solr can
>>>> process.
>>>> Then different Tokenizers can be used to deal with the data in the
>>>> CAS. the
>>>> main point is that the XML has  the doc and field labels of solr.
>>>
>>> I just committed the DelimitedPayloadTokenFilterFactory, I suspect
>>> this is along the lines of what you are thinking, but I haven't done
>>> all that much with UIMA.
>>>
>>> I also suspect the Tee/Sink capabilities of Lucene could be helpful,
>>> but they aren't available in Solr yet.
>>>
>>>
>>>
>>>
>>>> E: The set of capabilities to process the xml is defined in XML,
>>>> similar to
>>>> lucas to define the ouput and in the solr schema to define how this is
>>>> processed.
>>>>
>>>>
>>>> I want to use it in order to index something that is common but I
>>>> can't get
>>>> any tool to do that with sol: indexing a word and coding at the same
>>>> position the syntactic and semantic information. I know that in
>>>> Lucene this
>>>> is evolving and it will be possible to include metadata but for the
>>>> moment
>>>
>>> What does Lucas do with Lucene?  Is it putting multiple tokens at the
>>> same position or using Payloads?
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Solr and UIMA

Reply via email to