[ 
https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735089#action_12735089
 ] 

Thilo Goetz commented on UIMA-1447:
-----------------------------------

That would probably be the only place in the UIMA code where we handle 
surrogates correctly.  I wouldn't bother.  All our processing (like the 
"character" offsets) is done in terms of 16 bit code units, not code points 
(i.e., characters).  If Java ever switches to 32 bit code units, we'll have to 
make that move, too, and that should automatically make things work more 
correctly.  I don't think that's in the cards for the mid-term future, though.  
Too many things are riding on 16 bit code units.


> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows:        i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to