Re: Any interest in this as an open source project?

Michael A Tanenblatt Thu, 08 May 2008 13:47:22 -0700

OK, I tried to answer your questions in line, below--If not, I am sure I
can try again:


Marshall Schor <[EMAIL PROTECTED]> wrote on 05/08/2008 03:18:39 PM:

>
> Sounds interesting; see below for some questions:
>
> Michael A Tanenblatt wrote:
> > My group would like to offer the following UIMA component,
ConceptMapper,
> > as an open source offering into the UIMA sandbox, assuming there is
> > interest from the community:
> > ConceptMapper is a token-based dictionary lookup UIMA component. It was
> > designed specifically to allow any external tokenizer that is a UIMA
> > component to be used to tokenize its dictionary. Using the same
tokenizer
> > on both the dictionary and for subsequent text processing prevents
> > situations where a particular dictionary entry is not found, though it
> > exists, because it was tokenized differently than the text being
processed.
> >
> Is the idea that the tokenizer for the dictionary is run during some
> kind of "build" process which occurs, maybe once, before the "run"
process?

It depends on the size of the dictionary and how patient you are. The
dictionary is loaded as a UIMA resource, and the loading/tokenization can
be done at resource loading time, or it could be precompiled (Java object
serialization) and then loaded in that form. For dictionaries on the order
of 10K entries and running on a modern laptop, the loading doesn't take
more than a couple of seconds at most..


> > ConceptMapper is highly configurable, in terms of:
> >  * the way dictionary entries are mapped to resultant annotations
> >  * the way input documents are processed
> >  * the availability of multiple lookup strategies
> >  * its various output options.
> >
> > Additionally, a set of post-processing filters are supplied, as well as
an
> > interface to easily create new filters. This allows for overgenerating
> > results during the lookup phase, if so desired, then reducing the
result
> > set according to particular rules.
> >
> Can you give some examples of "overgenerating" and "filtering" in the
> context of looking things up?

Here is an example from the domain of colon pathology. Given the text:

      colon, rectum

and dictionary entries of

      colon
      rectum
      rectum colon

one could argue that one, two, or all three entries should be found. In
fact, finding all three was required for a recent project in which I was
involved. It is easy to configure ConceptMapper to find all three, but
perhaps one would want to then eliminate one of the results based on some
particular domain-specific rules. Another example from the same domain is
the text:

      carcinoma in adenomatous polyp

and the dictionary contains both:

      carcinoma
      carcinoma in adenomatous polyp

If you were to only looking for longest matches, the second item would be
found and you are done. But, if you allow for overlapping results, both
would be identified. Again, in the same recent project, this was a
requirement. The only exceptions were for "generic" terms like "carcinoma".
So we would generate both, then filter out the generic terms that are not
subsumed by a longer entries. But if they are not subsumed by a longer
entry, they would not be filtered out.

> > More details:
> >
> >
> Is the dictionary an external xml file?  Do you pre-process this into
> some run-time form, or load and tokenize the external dictionary every
> time this component is initialized?
>
> What does this component presume regarding memory footprint - does it
> work with large, external dictionaries without taking up very much
> "in-Ram" storage, or does it load the whole dictionary into memory in
> some internal format, for the duration of the run?

All in memory. Of course, this makes lookups pretty fast...



> > The structure of the dictionary itself is quite flexible. Entries can
have
> > any number of variants (synonyms), and arbitrary features can be
associated
> > with dictionary entries. Individual variants inherit features from
parent
> > token (i.e., the canonical from), but can override them or add
additional
> > features. In the following sample dictionary entry, there are 5
variants of
> > the canonical form, and as described earlier, each inherits the
SemClass
> > and POS attributes from the canonical form, with the exception of the
> > variant "mesenteric fibromatosis (c48.1)", which overrides the value of
the
> > SemClass attribute (this is somewhat of a contrived example, just to
make
> > that point):
> >
> >
> Is this the format of the external form of the dictionary?  Are the xml
> tags and attributes predefined, or is it up to the user to define them?

This is indeed the format of a single entry in the dictionary. The set of
attributes is not predefined, and the mapping from attributes to resultant
annotations is configurable in the AE descriptor.


> > <token canonical="abdominal fibromatosis" SemClass="Diagnosis"
POS="NN">
> >    <variant base="abdominal fibromatosis" />
> >    <variant base="abdominal desmoid" />
> >    <variant base="mesenteric fibromatosis (c48.1)"
> > SemClass="Diagnosis-Site" />
> >    <variant base="mesenteric fibromatosis" />
> >    <variant base="retroperitoneal fibromatosis" />
> > </token>
> >
> > Input tokens are processed one span at a time, where both the token and
> > span (usually a sentence) annotation type are configurable.
Additionally,
> > the particular feature of the token annotation to use for lookups can
be
> > specified, otherwise its covered text is used. Other input
configuration
> > settings are whether to use case sensitive matching, an optional class
name
> > of a stemmer to apply to the tokens, and a list of stop words to to
ignore
> > during lookup. One additional input control mechanism is the ability to
> > skip tokens during lookups based on particular feature values. In this
way,
> > it is easy to skip, for example, all tokens with particular part of
speech
> > tags, or with some previously computed semantic class.
> >
> > Output is in the form of new annotations, and the type of resulting
> > annotations can be specified in a descriptor file. The mapping from
> > dictionary entry attributes to the result annotation features can also
be
> > specified. Additionally, a string containing the matched text, a list
of
> > matched tokens, and the span enclosing the match can be specified to be
set
> > in the result annotations. It is also possible to indicate dictionary
> > attributes to write back into each of the matched tokens.
> >
> > Dictionary lookup is controlled by three parameters in the descriptor,
one
> > of which allows for order-independent lookup (i.e., A B == B A),
another
> > togles between finding only the longest match vs. finding all possible
> > matches.
> This seems to imply that the dictionary items can be multi-token things
> (as opposed to just single token lookups), with different kinds of
> matching of the input against these; is that right?


If I understand your question, the answer is yes. In the example dictionary
entry above, the "base" attribute of the variant elements contain the text
to match against, and they are all made up of multiple tokens (assuming the
tokenizer breaks on whitespace).>


> > The final parameter specifies the search strategy, of which there
> > are three. The default search strategy only considers contiguous tokens
> > (not including tokens frm the stop word list or otherwise skipped
tokens),
> > and then begins the subsequent search after the longest match. The
second
> > strategy allows for ignoring non-matching tokens, allowing for disjoint
> > matches, so that a dictionary entry of
> >
> >     A C
> >
> > would match against the text
> >
> >     A B C
> >
> > As with the default search strategy, the subsequent search begins after
the
> > longest match. The final search strategy is identical to the previous,
> > except that subsequent searches begin one token ahead, instead of after
the
> > previous match. This enables overlapped matching.
> >
> >
> > --
> > Michael Tanenblatt
> > IBM T.J. Watson Research Center
> > 19 Skyline Drive
> > P.O. Box 704
> > Hawthorne, NY 10532
> > USA
> > Tel: +1 (914) 784 7030 t/l 863 7030
> > Fax: +1 (914) 784 6054
> > [EMAIL PROTECTED]
> >
> Thanks.  -Marshall

Re: Any interest in this as an open source project?

Reply via email to