I happen to be in need of this feature when the email came out. I downloaded the source and created an eclipse project for it. However, the project is not compiling. All the errors are in the ...../tokenizer package. I would appreciate getting this to compile. Thanks, Ahmed
On Tue, Jun 17, 2008 at 4:13 PM, Marshall Schor (JIRA) < [email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605709#action_12605709] > > Marshall Schor commented on UIMA-1033: > -------------------------------------- > > Software grant for this has been received and recorded. > > > ConceptMapper--a highly configurable, token-based dictionary lookup UIMA > component > > > ---------------------------------------------------------------------------------- > > > > Key: UIMA-1033 > > URL: https://issues.apache.org/jira/browse/UIMA-1033 > > Project: UIMA > > Issue Type: New Feature > > Components: Sandbox > > Environment: Java 5 > > Reporter: Michael Tanenblatt > > Priority: Minor > > Attachments: conceptMapper.zip, conceptMapper.zip.md5 > > > > Original Estimate: 24h > > Remaining Estimate: 24h > > > > ConceptMapper is a token-based dictionary lookup UIMA component. It was > > designed specifically to allow any external tokenizer that is a UIMA > > component to be used to tokenize its dictionary. Using the same tokenizer > > on both the dictionary and for subsequent text processing prevents > > situations where a particular dictionary entry is not found, though it > > exists, because it was tokenized differently than the text being > processed. > > ConceptMapper is highly configurable, in terms of: > > * the way dictionary entries are mapped to resultant annotations > > * the way input documents are processed > > * the availability of multiple lookup strategies > > * its various output options. > > Additionally, a set of post-processing filters are supplied, as well as > an > > interface to easily create new filters. This allows for overgenerating > > results during the lookup phase, if so desired, then reducing the result > > set according to particular rules. > > More details: > > The structure of the dictionary itself is quite flexible. Entries can > have > > any number of variants (synonyms), and arbitrary features can be > associated > > with dictionary entries. Individual variants inherit features from parent > > token (i.e., the canonical from), but can override them or add additional > > features. In the following sample dictionary entry, there are 5 variants > of > > the canonical form, and as described earlier, each inherits the SemClass > > and POS attributes from the canonical form, with the exception of the > > variant "mesenteric fibromatosis (c48.1)", which overrides the value of > the > > SemClass attribute (this is somewhat of a contrived example, just to make > > that point): > > <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN"> > > <variant base="abdominal fibromatosis" /> > > <variant base="abdominal desmoid" /> > > <variant base="mesenteric fibromatosis (c48.1)" > > SemClass="Diagnosis-Site" /> > > <variant base="mesenteric fibromatosis" /> > > <variant base="retroperitoneal fibromatosis" /> > > </token> > > Input tokens are processed one span at a time, where both the token and > > span (usually a sentence) annotation type are configurable. Additionally, > > the particular feature of the token annotation to use for lookups can be > > specified, otherwise its covered text is used. Other input configuration > > settings are whether to use case sensitive matching, an optional class > name > > of a stemmer to apply to the tokens, and a list of stop words to to > ignore > > during lookup. One additional input control mechanism is the ability to > > skip tokens during lookups based on particular feature values. In this > way, > > it is easy to skip, for example, all tokens with particular part of > speech > > tags, or with some previously computed semantic class. > > Output is in the form of new annotations, and the type of resulting > > annotations can be specified in a descriptor file. The mapping from > > dictionary entry attributes to the result annotation features can also be > > specified. Additionally, a string containing the matched text, a list of > > matched tokens, and the span enclosing the match can be specified to be > set > > in the result annotations. It is also possible to indicate dictionary > > attributes to write back into each of the matched tokens. > > Dictionary lookup is controlled by three parameters in the descriptor, > one > > of which allows for order-independent lookup (i.e., A B == B A), another > > togles between finding only the longest match vs. finding all possible > > matches. The final parameter specifies the search strategy, of which > there > > are three. The default search strategy only considers contiguous tokens > > (not including tokens frm the stop word list or otherwise skipped > tokens), > > and then begins the subsequent search after the longest match. The second > > strategy allows for ignoring non-matching tokens, allowing for disjoint > > matches, so that a dictionary entry of > > A C > > would match against the text > > A B C > > As with the default search strategy, the subsequent search begins after > the > > longest match. The final search strategy is identical to the previous, > > except that subsequent searches begin one token ahead, instead of after > the > > previous match. This enables overlapped matching. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
