Re: Any interest in this as an open source project?

Michael Tanenblatt Thu, 08 May 2008 19:31:43 -0700


Answers inline, below:


David Buttler <[EMAIL PROTECTED]> wrote on 05/08/2008 06:59:31 PM:

>

> I wrote a tool similar to this, but with a bit less functionality,so I

> think this type of tool is very useful and I would be interested in
> contributing.  The key features that I would look for are:
> 1) it is fast

I don't have any hard numbers to share at the moment, but performanceis very good, even with very large dictionaries.



> 2) it can handle very large dictionaries without slowing down. For

> example, you might want to load UMLS into a dictionary (Assumingyou had

> sufficient memory)

It can. Or so I say :)

> You mentioned that you support 10K entries -- is the runtimedependent

> on the number of entries in the dictionary or on the number of token
> matches?  Is the internal data structure some type of state machine?
>

My comment about 10K entries was just an example. You're limited onlyby your available memory. Someone I know has used the annotator withmillions of entries with no problems.The internal data structure is simply a map keyed by a head word,pointing to an potential matches starting with that head word (orderedby length, to facilitate longest-match). When order-independent lookupis enabled (yes, this is a dangerous thing, but can be useful in somedomains), each token of each entry is used as a key, which does blowup memory usage a bit.


> It wasn't clear to me if you supported boolean operators but perhaps

> this is the type of functionality that you would put in a postfilter?> e.g. you match 'colon' and 'rectum' separately and only produceresults

> when both matches are made, but not when 'colonoscopy' is present.
>

That would probably be done with some post-processing. Matching isstrictly done as string matches, with the only exception being case-insensitivity, stemming and token skipping (either via stop word listor based on particular feature values, as I described [or tried to]).One other possibility might be to run the annotator twice, oncemarking all tokens in the presence of 'colonoscopy' with some marker,then skipping all tokens with said marker in the second pass. That'snot too efficient, but might be suitable in certain circumstances.


> So, if you could skip tokens, would it be possible for an entire

> document to match assuming the dictionary contained 'A B' and thefirst> token in the document is 'A' and the last token is 'B'? Or do youlimit

> the match to a window of some type?  If it is a window, is the window

> defined by the data (e.g. paragraph markers) or by the dictionary(e.g.

> N tokens?)

As I said in my original post: "Input tokens are processed one span ata time, where both the token and span (usually a sentence) annotationtype are configurable." So, you could specify DocumentAnnotation asthe span, but I have usually used a sentence. In any case, the span touse is an annotation, and the type of annotation is specified in thedescriptor file used for running the annotator.



>
> Another feature that seems useful is token-based regular expressions
> (e.g. matching 'run*' or '199?').  This feature really killed

> performance when I added it to my tool; perhaps you have a betterway of

> approaching that requirement.

Nope. This is not supported at this point. Some have suggested addingthis, but it was not something deemed necessary in any of my projects,and would likely be difficult to implement efficiently, as you hadfound. It would certainly be a nice thing to add in the next release,if done well...


>
> In any case, it seems very interesting.
> Dave
>
> Michael A Tanenblatt wrote:

> > My group would like to offer the following UIMA component,ConceptMapper,

> > as an open source offering into the UIMA sandbox, assuming there is
> > interest from the community:
> >

> > ConceptMapper is a token-based dictionary lookup UIMA component.It was> > designed specifically to allow any external tokenizer that is aUIMA> > component to be used to tokenize its dictionary. Using the sametokenizer

> > on both the dictionary and for subsequent text processing prevents

> > situations where a particular dictionary entry is not found,though it> > exists, because it was tokenized differently than the text beingprocessed.

> >
> > ConceptMapper is highly configurable, in terms of:
> >  * the way dictionary entries are mapped to resultant annotations
> >  * the way input documents are processed
> >  * the availability of multiple lookup strategies
> >  * its various output options.
> >

> > Additionally, a set of post-processing filters are supplied, aswell as an> > interface to easily create new filters. This allows forovergenerating> > results during the lookup phase, if so desired, then reducing theresult

> > set according to particular rules.
> >
> > More details:
> >

> > The structure of the dictionary itself is quite flexible. Entriescan have> > any number of variants (synonyms), and arbitrary features can beassociated> > with dictionary entries. Individual variants inherit featuresfrom parent> > token (i.e., the canonical from), but can override them or addadditional> > features. In the following sample dictionary entry, there are 5variants of> > the canonical form, and as described earlier, each inherits theSemClass> > and POS attributes from the canonical form, with the exception ofthe> > variant "mesenteric fibromatosis (c48.1)", which overrides thevalue of the> > SemClass attribute (this is somewhat of a contrived example, justto make

> > that point):
> >

> > <token canonical="abdominal fibromatosis" SemClass="Diagnosis"POS="NN">

> >    <variant base="abdominal fibromatosis" />
> >    <variant base="abdominal desmoid" />
> >    <variant base="mesenteric fibromatosis (c48.1)"
> > SemClass="Diagnosis-Site" />
> >    <variant base="mesenteric fibromatosis" />
> >    <variant base="retroperitoneal fibromatosis" />
> > </token>
> >

> > Input tokens are processed one span at a time, where both thetoken and> > span (usually a sentence) annotation type are configurable.Additionally,> > the particular feature of the token annotation to use for lookupscan be> > specified, otherwise its covered text is used. Other inputconfiguration> > settings are whether to use case sensitive matching, an optionalclass name> > of a stemmer to apply to the tokens, and a list of stop words toto ignore> > during lookup. One additional input control mechanism is theability to> > skip tokens during lookups based on particular feature values. Inthis way,> > it is easy to skip, for example, all tokens with particular partof speech

> > tags, or with some previously computed semantic class.
> >
> > Output is in the form of new annotations, and the type of resulting
> > annotations can be specified in a descriptor file. The mapping from

> > dictionary entry attributes to the result annotation features canalso be> > specified. Additionally, a string containing the matched text, alist of> > matched tokens, and the span enclosing the match can be specifiedto be set> > in the result annotations. It is also possible to indicatedictionary

> > attributes to write back into each of the matched tokens.
> >

> > Dictionary lookup is controlled by three parameters in thedescriptor, one> > of which allows for order-independent lookup (i.e., A B == B A),another> > togles between finding only the longest match vs. finding allpossible> > matches. The final parameter specifies the search strategy, ofwhich there> > are three. The default search strategy only considers contiguoustokens> > (not including tokens frm the stop word list or otherwise skippedtokens),> > and then begins the subsequent search after the longest match.The second> > strategy allows for ignoring non-matching tokens, allowing fordisjoint

> > matches, so that a dictionary entry of
> >
> >     A C
> >
> > would match against the text
> >
> >     A B C
> >

> > As with the default search strategy, the subsequent search beginsafter the> > longest match. The final search strategy is identical to theprevious,> > except that subsequent searches begin one token ahead, instead ofafter the

> > previous match. This enables overlapped matching.
> >
> >
> > --
> > Michael Tanenblatt
> > IBM T.J. Watson Research Center
> > 19 Skyline Drive
> > P.O. Box 704
> > Hawthorne, NY 10532
> > USA
> > Tel: +1 (914) 784 7030 t/l 863 7030
> > Fax: +1 (914) 784 6054
> > [EMAIL PROTECTED]
> >
> >
> >
>

Re: Any interest in this as an open source project?

Reply via email to