I think I found the problem. In the class
.....support.tokenizer.OffsetTokenizer.java the following code need to
replace the existing code:

      TokenAnnotation returnVal = new TokenAnnotation(jcas);

      // System.out.println("token = " + token.toString() + " fold = " +

      // foldCase(token.toString()));

      returnVal.setText(stem(foldCase(token.toString())));

      returnVal.setBegin(start);

      returnVal.setEnd(offset);


      return returnVal;

Then you need to regenerate the TokenAnnotation TypeSystem classes.

Can someone confirm the correctness of this?


A quick question: what is the uima.tt package for? And, is there a reason
for not giving it a name similar to the other packages?


Thanks!

Ahmed




On Tue, Jun 17, 2008 at 4:49 PM, Ahmed Abdeen Hamed <[EMAIL PROTECTED]>
wrote:

> I happen to be in need of this feature when the email came out. I
> downloaded the source and created an eclipse project for it. However, the
> project is not compiling. All the errors are in the ...../tokenizer package.
> I would appreciate getting this to compile.
> Thanks,
> Ahmed
>
>
> On Tue, Jun 17, 2008 at 4:13 PM, Marshall Schor (JIRA) <
> [email protected]> wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/UIMA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605709#action_12605709]
>>
>> Marshall Schor commented on UIMA-1033:
>> --------------------------------------
>>
>> Software grant for this has been received and recorded.
>>
>> > ConceptMapper--a highly configurable, token-based dictionary lookup UIMA
>> component
>> >
>> ----------------------------------------------------------------------------------
>> >
>> >                 Key: UIMA-1033
>> >                 URL: https://issues.apache.org/jira/browse/UIMA-1033
>> >             Project: UIMA
>> >          Issue Type: New Feature
>> >          Components: Sandbox
>> >         Environment: Java 5
>> >            Reporter: Michael Tanenblatt
>> >            Priority: Minor
>> >         Attachments: conceptMapper.zip, conceptMapper.zip.md5
>> >
>> >   Original Estimate: 24h
>> >  Remaining Estimate: 24h
>> >
>> > ConceptMapper is a token-based dictionary lookup UIMA component. It was
>> > designed specifically to allow any external tokenizer that is a UIMA
>> > component to be used to tokenize its dictionary. Using the same
>> tokenizer
>> > on both the dictionary and for subsequent text processing prevents
>> > situations where a particular dictionary entry is not found, though it
>> > exists, because it was tokenized differently than the text being
>> processed.
>> > ConceptMapper is highly configurable, in terms of:
>> >  * the way dictionary entries are mapped to resultant annotations
>> >  * the way input documents are processed
>> >  * the availability of multiple lookup strategies
>> >  * its various output options.
>> > Additionally, a set of post-processing filters are supplied, as well as
>> an
>> > interface to easily create new filters. This allows for overgenerating
>> > results during the lookup phase, if so desired, then reducing the result
>> > set according to particular rules.
>> > More details:
>> > The structure of the dictionary itself is quite flexible. Entries can
>> have
>> > any number of variants (synonyms), and arbitrary features can be
>> associated
>> > with dictionary entries. Individual variants inherit features from
>> parent
>> > token (i.e., the canonical from), but can override them or add
>> additional
>> > features. In the following sample dictionary entry, there are 5 variants
>> of
>> > the canonical form, and as described earlier, each inherits the SemClass
>> > and POS attributes from the canonical form, with the exception of the
>> > variant "mesenteric fibromatosis (c48.1)", which overrides the value of
>> the
>> > SemClass attribute (this is somewhat of a contrived example, just to
>> make
>> > that point):
>> > <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
>> >    <variant base="abdominal fibromatosis" />
>> >    <variant base="abdominal desmoid" />
>> >    <variant base="mesenteric fibromatosis (c48.1)"
>> > SemClass="Diagnosis-Site" />
>> >    <variant base="mesenteric fibromatosis" />
>> >    <variant base="retroperitoneal fibromatosis" />
>> > </token>
>> > Input tokens are processed one span at a time, where both the token and
>> > span (usually a sentence) annotation type are configurable.
>> Additionally,
>> > the particular feature of the token annotation to use for lookups can be
>> > specified, otherwise its covered text is used. Other input configuration
>> > settings are whether to use case sensitive matching, an optional class
>> name
>> > of a stemmer to apply to the tokens, and a list of stop words to to
>> ignore
>> > during lookup. One additional input control mechanism is the ability to
>> > skip tokens during lookups based on particular feature values. In this
>> way,
>> > it is easy to skip, for example, all tokens with particular part of
>> speech
>> > tags, or with some previously computed semantic class.
>> > Output is in the form of new annotations, and the type of resulting
>> > annotations can be specified in a descriptor file. The mapping from
>> > dictionary entry attributes to the result annotation features can also
>> be
>> > specified. Additionally, a string containing the matched text, a list of
>> > matched tokens, and the span enclosing the match can be specified to be
>> set
>> > in the result annotations. It is also possible to indicate dictionary
>> > attributes to write back into each of the matched tokens.
>> > Dictionary lookup is controlled by three parameters in the descriptor,
>> one
>> > of which allows for order-independent lookup (i.e., A B == B A), another
>> > togles between finding only the longest match vs. finding all possible
>> > matches. The final parameter specifies the search strategy, of which
>> there
>> > are three. The default search strategy only considers contiguous tokens
>> > (not including tokens frm the stop word list or otherwise skipped
>> tokens),
>> > and then begins the subsequent search after the longest match. The
>> second
>> > strategy allows for ignoring non-matching tokens, allowing for disjoint
>> > matches, so that a dictionary entry of
>> >     A C
>> > would match against the text
>> >     A B C
>> > As with the default search strategy, the subsequent search begins after
>> the
>> > longest match. The final search strategy is identical to the previous,
>> > except that subsequent searches begin one token ahead, instead of after
>> the
>> > previous match. This enables overlapped matching.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Reply via email to