[ 
https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734918#action_12734918
 ] 

Thilo Goetz commented on UIMA-1447:
-----------------------------------

Marshall, I don't think this is something we need to worry about.  If people 
have code working around these issues, that code will simply no longer be 
called.  For example, people might have code to check and skip tokens that just 
contain whitespace.  I have written such code myself in the past, other 
tokenizers have similar issues.  I'm +1 for Joern's solution, and that should 
be the default as well.  I wouldn't even support the old behavior, not even 
with an option.  It'll just make the code more complicated for no good reason.  
If somebody really desperately needs the old behavior, they can use an old 
version of the tokenizer.

> Tabulations are annotated as tokens after a space
> -------------------------------------------------
>
>                 Key: UIMA-1447
>                 URL: https://issues.apache.org/jira/browse/UIMA-1447
>             Project: UIMA
>          Issue Type: Bug
>          Components: Sandbox-WhitespaceTokenizer
>    Affects Versions: 2.3S
>         Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5
>            Reporter: Jérôme Rocheteau
>         Attachments: patch-an-wst.txt
>
>
> This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. 
> It behaves as follows:        i.e. a '\t' character after a space is 
> annotated as a token and its covered text is set to the empty string ""! 
> I suppose it shoudn't be the case, am I wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to