[ https://issues.apache.org/jira/browse/UIMA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734918#action_12734918 ]
Thilo Goetz commented on UIMA-1447: ----------------------------------- Marshall, I don't think this is something we need to worry about. If people have code working around these issues, that code will simply no longer be called. For example, people might have code to check and skip tokens that just contain whitespace. I have written such code myself in the past, other tokenizers have similar issues. I'm +1 for Joern's solution, and that should be the default as well. I wouldn't even support the old behavior, not even with an option. It'll just make the code more complicated for no good reason. If somebody really desperately needs the old behavior, they can use an old version of the tokenizer. > Tabulations are annotated as tokens after a space > ------------------------------------------------- > > Key: UIMA-1447 > URL: https://issues.apache.org/jira/browse/UIMA-1447 > Project: UIMA > Issue Type: Bug > Components: Sandbox-WhitespaceTokenizer > Affects Versions: 2.3S > Environment: Unix (ubuntu 8.04), Eclipse Galileo 3.5 > Reporter: Jérôme Rocheteau > Attachments: patch-an-wst.txt > > > This is a test-text for the Whitespace Tokenizer in the UIMA Sandbox. > It behaves as follows: i.e. a '\t' character after a space is > annotated as a token and its covered text is set to the empty string ""! > I suppose it shoudn't be the case, am I wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.