I happen to be in need of this feature when the email came out. I downloaded
the source and created an eclipse project for it. However, the project is
not compiling. All the errors are in the ...../tokenizer package.
I would appreciate getting this to compile.
Thanks,
Ahmed

On Tue, Jun 17, 2008 at 4:13 PM, Marshall Schor (JIRA) <
[email protected]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605709#action_12605709]
>
> Marshall Schor commented on UIMA-1033:
> --------------------------------------
>
> Software grant for this has been received and recorded.
>
> > ConceptMapper--a highly configurable, token-based dictionary lookup UIMA
> component
> >
> ----------------------------------------------------------------------------------
> >
> >                 Key: UIMA-1033
> >                 URL: https://issues.apache.org/jira/browse/UIMA-1033
> >             Project: UIMA
> >          Issue Type: New Feature
> >          Components: Sandbox
> >         Environment: Java 5
> >            Reporter: Michael Tanenblatt
> >            Priority: Minor
> >         Attachments: conceptMapper.zip, conceptMapper.zip.md5
> >
> >   Original Estimate: 24h
> >  Remaining Estimate: 24h
> >
> > ConceptMapper is a token-based dictionary lookup UIMA component. It was
> > designed specifically to allow any external tokenizer that is a UIMA
> > component to be used to tokenize its dictionary. Using the same tokenizer
> > on both the dictionary and for subsequent text processing prevents
> > situations where a particular dictionary entry is not found, though it
> > exists, because it was tokenized differently than the text being
> processed.
> > ConceptMapper is highly configurable, in terms of:
> >  * the way dictionary entries are mapped to resultant annotations
> >  * the way input documents are processed
> >  * the availability of multiple lookup strategies
> >  * its various output options.
> > Additionally, a set of post-processing filters are supplied, as well as
> an
> > interface to easily create new filters. This allows for overgenerating
> > results during the lookup phase, if so desired, then reducing the result
> > set according to particular rules.
> > More details:
> > The structure of the dictionary itself is quite flexible. Entries can
> have
> > any number of variants (synonyms), and arbitrary features can be
> associated
> > with dictionary entries. Individual variants inherit features from parent
> > token (i.e., the canonical from), but can override them or add additional
> > features. In the following sample dictionary entry, there are 5 variants
> of
> > the canonical form, and as described earlier, each inherits the SemClass
> > and POS attributes from the canonical form, with the exception of the
> > variant "mesenteric fibromatosis (c48.1)", which overrides the value of
> the
> > SemClass attribute (this is somewhat of a contrived example, just to make
> > that point):
> > <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
> >    <variant base="abdominal fibromatosis" />
> >    <variant base="abdominal desmoid" />
> >    <variant base="mesenteric fibromatosis (c48.1)"
> > SemClass="Diagnosis-Site" />
> >    <variant base="mesenteric fibromatosis" />
> >    <variant base="retroperitoneal fibromatosis" />
> > </token>
> > Input tokens are processed one span at a time, where both the token and
> > span (usually a sentence) annotation type are configurable. Additionally,
> > the particular feature of the token annotation to use for lookups can be
> > specified, otherwise its covered text is used. Other input configuration
> > settings are whether to use case sensitive matching, an optional class
> name
> > of a stemmer to apply to the tokens, and a list of stop words to to
> ignore
> > during lookup. One additional input control mechanism is the ability to
> > skip tokens during lookups based on particular feature values. In this
> way,
> > it is easy to skip, for example, all tokens with particular part of
> speech
> > tags, or with some previously computed semantic class.
> > Output is in the form of new annotations, and the type of resulting
> > annotations can be specified in a descriptor file. The mapping from
> > dictionary entry attributes to the result annotation features can also be
> > specified. Additionally, a string containing the matched text, a list of
> > matched tokens, and the span enclosing the match can be specified to be
> set
> > in the result annotations. It is also possible to indicate dictionary
> > attributes to write back into each of the matched tokens.
> > Dictionary lookup is controlled by three parameters in the descriptor,
> one
> > of which allows for order-independent lookup (i.e., A B == B A), another
> > togles between finding only the longest match vs. finding all possible
> > matches. The final parameter specifies the search strategy, of which
> there
> > are three. The default search strategy only considers contiguous tokens
> > (not including tokens frm the stop word list or otherwise skipped
> tokens),
> > and then begins the subsequent search after the longest match. The second
> > strategy allows for ignoring non-matching tokens, allowing for disjoint
> > matches, so that a dictionary entry of
> >     A C
> > would match against the text
> >     A B C
> > As with the default search strategy, the subsequent search begins after
> the
> > longest match. The final search strategy is identical to the previous,
> > except that subsequent searches begin one token ahead, instead of after
> the
> > previous match. This enables overlapped matching.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Reply via email to