[ 
https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735031#action_12735031
 ] 

Jörn Kottmann commented on UIMA-1447:
-------------------------------------

I never really understood how isWhitespace must be called. There is one which 
takes a char and one that takes an int as parameter.
The one with the int was added in java 1.5. And they write in the javadoc that 
it must be used to also support supplementary characters.

Do we have to support supplementary characters in our text processing code ?

If so we then we first must find out if the 16 bit char is a high surrogate 
code unit and depending
on that either pass one or two code units (as 32 bit int), right ?

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows:        i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to