Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Michael Tanenblatt Tue, 17 Jun 2008 17:06:33 -0700

As Thilo mentioned in an email from May 19, 2008, I forgot to includethe source for uima.tt.TokenAnnotation, but otherwise the code shouldbe fine.

Additionally, the problem you are seeing is with OffsetTokenizer,which is just a sample tokenizer--if you have another, more robusttokenizer, you don't need this OffsetTokenizer.



On Jun 17, 2008, at 6:23 PM, Ahmed Abdeen Hamed wrote:

I think I found the problem. In the class
.....support.tokenizer.OffsetTokenizer.java the following code need to
replace the existing code:

     TokenAnnotation returnVal = new TokenAnnotation(jcas);
// System.out.println("token = " + token.toString() + " fold =" +
     // foldCase(token.toString()));

     returnVal.setText(stem(foldCase(token.toString())));

     returnVal.setBegin(start);

     returnVal.setEnd(offset);


     return returnVal;

Then you need to regenerate the TokenAnnotation TypeSystem classes.

Can someone confirm the correctness of this?
A quick question: what is the uima.tt package for? And, is there areason
for not giving it a name similar to the other packages?


Thanks!

Ahmed
On Tue, Jun 17, 2008 at 4:49 PM, Ahmed Abdeen Hamed <[EMAIL PROTECTED]>
wrote:
I happen to be in need of this feature when the email came out. I
downloaded the source and created an eclipse project for it.However, theproject is not compiling. All the errors are in the ...../tokenizerpackage.
I would appreciate getting this to compile.
Thanks,
Ahmed


On Tue, Jun 17, 2008 at 4:13 PM, Marshall Schor (JIRA) <
[email protected]> wrote:
  [
https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605709#action_12605709]
Marshall Schor commented on UIMA-1033:
--------------------------------------

Software grant for this has been received and recorded.
ConceptMapper--a highly configurable, token-based dictionarylookup UIMA
component
----------------------------------------------------------------------------------
               Key: UIMA-1033
               URL: https://issues.apache.org/jira/browse/UIMA-1033
           Project: UIMA
        Issue Type: New Feature
        Components: Sandbox
       Environment: Java 5
          Reporter: Michael Tanenblatt
          Priority: Minor
       Attachments: conceptMapper.zip, conceptMapper.zip.md5

 Original Estimate: 24h
Remaining Estimate: 24h
ConceptMapper is a token-based dictionary lookup UIMA component.It wasdesigned specifically to allow any external tokenizer that is aUIMA
component to be used to tokenize its dictionary. Using the same
tokenizer
on both the dictionary and for subsequent text processing prevents
situations where a particular dictionary entry is not found,though it
exists, because it was tokenized differently than the text being
processed.
ConceptMapper is highly configurable, in terms of:
* the way dictionary entries are mapped to resultant annotations
* the way input documents are processed
* the availability of multiple lookup strategies
* its various output options.
Additionally, a set of post-processing filters are supplied, aswell as
an
interface to easily create new filters. This allows forovergeneratingresults during the lookup phase, if so desired, then reducing theresult
set according to particular rules.
More details:
The structure of the dictionary itself is quite flexible. Entriescan
have
any number of variants (synonyms), and arbitrary features can be
associated
with dictionary entries. Individual variants inherit features from
parent
token (i.e., the canonical from), but can override them or add
additional
features. In the following sample dictionary entry, there are 5variants
of
the canonical form, and as described earlier, each inherits theSemClassand POS attributes from the canonical form, with the exception ofthevariant "mesenteric fibromatosis (c48.1)", which overrides thevalue of
the
SemClass attribute (this is somewhat of a contrived example, justto
make
that point):
<token canonical="abdominal fibromatosis" SemClass="Diagnosis"POS="NN">
  <variant base="abdominal fibromatosis" />
  <variant base="abdominal desmoid" />
  <variant base="mesenteric fibromatosis (c48.1)"
SemClass="Diagnosis-Site" />
  <variant base="mesenteric fibromatosis" />
  <variant base="retroperitoneal fibromatosis" />
</token>
Input tokens are processed one span at a time, where both thetoken and
span (usually a sentence) annotation type are configurable.
Additionally,
the particular feature of the token annotation to use for lookupscan bespecified, otherwise its covered text is used. Other inputconfigurationsettings are whether to use case sensitive matching, an optionalclass
name
of a stemmer to apply to the tokens, and a list of stop words to to
ignore
during lookup. One additional input control mechanism is theability toskip tokens during lookups based on particular feature values. Inthis
way,
it is easy to skip, for example, all tokens with particular part of
speech
tags, or with some previously computed semantic class.
Output is in the form of new annotations, and the type of resulting
annotations can be specified in a descriptor file. The mapping from
dictionary entry attributes to the result annotation features canalso
be
specified. Additionally, a string containing the matched text, alist ofmatched tokens, and the span enclosing the match can be specifiedto be
set
in the result annotations. It is also possible to indicatedictionary
attributes to write back into each of the matched tokens.
Dictionary lookup is controlled by three parameters in thedescriptor,
one
of which allows for order-independent lookup (i.e., A B == B A),anothertogles between finding only the longest match vs. finding allpossiblematches. The final parameter specifies the search strategy, ofwhich
there
are three. The default search strategy only considers contiguoustokens
(not including tokens frm the stop word list or otherwise skipped
tokens),
and then begins the subsequent search after the longest match. The
second
strategy allows for ignoring non-matching tokens, allowing fordisjoint
matches, so that a dictionary entry of
   A C
would match against the text
   A B C
As with the default search strategy, the subsequent search beginsafter
the
longest match. The final search strategy is identical to theprevious,except that subsequent searches begin one token ahead, instead ofafter
the
previous match. This enables overlapped matching.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Reply via email to