[ https://issues.apache.org/jira/browse/OPENNLP-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607527#comment-13607527 ]
Joern Kottmann edited comment on OPENNLP-546 at 1/11/17 12:32 PM: ------------------------------------------------------------------ The current implementation of the cTAKES PTB tokenizer outputs newline tokens, but the OpenNLP tokenizers don't support this yet. There are two ways of supporting this: - Only output the tokens without newline tokens and add the newline tokens in a second run, e.g. by a UIMA AE - Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, or a span with this as the type) was (Author: joern): The current implementation of the PTB tokenizer outputs newline tokens, but the OpenNLP tokenizers don't support this yet. There are two ways of supporting this: - Only output the tokens without newline tokens and add the newline tokens in a second run, e.g. by a UIMA AE - Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, or a span with this as the type) > Add TokenizerPTB > ---------------- > > Key: OPENNLP-546 > URL: https://issues.apache.org/jira/browse/OPENNLP-546 > Project: OpenNLP > Issue Type: New Feature > Components: Tokenizer > Reporter: Pei Chen > Priority: Minor > Labels: help-wanted > > Add Tokenizer based on Penn Tree Bank rules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)