Re: Solr and UIMA

Grant Ingersoll Thu, 23 Jul 2009 10:02:54 -0700


On Jul 21, 2009, at 11:57 AM, JCodina wrote:

Hello, Grant,
there are two ways, to implement this, one is payloads, and theother one is
multiple tokens at the same positions.
Each of them can be useful, let me explain the way I thick they canbe used.
Payloads : every token has extra information that can be used in the
processing , for example if I can add Part-of-speech then I candeveloptokenizers that take into account the POS (or for example I cangeneratebigrams of Noum Adjective, or Noum prep Noum i can have a betterstopwords
algorithm....)
Multiple tokes in one position: If I can have different tokens atthe sameplace, I can have different informations like: "was #verb _be" so Ican do asearch for "you _be #adjective" to find all the sentences that talkabout
"you" for example "you were clever" "you are tall" ......

This was one of the use cases for payloads as well, but it likelyneeds more Query support at the moment, as the BoostingTermQuery wouldonly allow you to boost values where it's a verb, not include/exclude.

I have not understood the way that theDelimitedPayloadTokenFilterFactory
may work in solr, which is the input format?

the DPTFF (nice acronym, eh?) allows you to send in your normal SolrXML, but with payloads encoded in the text. For instance:

The DPTFF will take the value before the delimiter as the Token andthe value after the delimiter as the payload. This then allows Solrto add Payloads without modifying a single thing in Solr, at least onthe indexing side.

so I was thinking in generating an xml where for each token a singlestring
is generated like "was#verb#be"
and then there is a tokenfilter that splits by # each white spaceseparatedstring, in this case in three words and adds the trailingcharacter that
allows to search for the right semantic info. But gives them the same
increment. Of course the full processing chain must be aware of this.
But I must think on multiwords tokens

We could likely make a generic TokenFilter that can capture bothmultiple tokens and payloads all at the same time, simply by allowingit to have to attributes:

1. token delimiter (#)
2. payload delimiter (|)

Then, you could do something like:
was#be|verb
or
was#be|0.3

where "was" and "be" are both tokens at the same position and "verb"or "0.3" are payloads on those tokens. This is a nearly trivialvariation of the DelimitedPayloadTokenFilter


Grant Ingersoll-6 wrote:



On Jul 20, 2009, at 6:43 AM, JCodina wrote:

D: Break things down. The CAS would only produce XML that solr can
process.
Then different Tokenizers can be used to deal with the data in the
CAS. the
main point is that the XML has  the doc and field labels of solr.


I just committed the DelimitedPayloadTokenFilterFactory, I suspect
this is along the lines of what you are thinking, but I haven't done
all that much with UIMA.

I also suspect the Tee/Sink capabilities of Lucene could be helpful,
but they aren't available in Solr yet.

E: The set of capabilities to process the xml is defined in XML,
similar to

lucas to define the ouput and in the solr schema to define howthis is

processed.


I want to use it in order to index something that is common but I
can't get
any tool to do that with sol: indexing a word and coding at the same
position the syntactic and semantic information. I know that in
Lucene this
is evolving and it will be possible to include metadata but for the
moment


What does Lucas do with Lucene?  Is it putting multiple tokens at the
same position or using Payloads?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


--
View this message in context: 
http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
Sent from the Solr - User mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Solr and UIMA

Reply via email to