[jira] [Comment Edited] (OPENNLP-546) Add TokenizerPTB

Joern Kottmann (JIRA) Wed, 11 Jan 2017 04:33:12 -0800

    [ 
https://issues.apache.org/jira/browse/OPENNLP-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607527#comment-13607527
 ]


Joern Kottmann edited comment on OPENNLP-546 at 1/11/17 12:32 PM:
------------------------------------------------------------------

The current implementation of the cTAKES PTB tokenizer outputs newline tokens, 
but the OpenNLP tokenizers don't support this yet.

There are two ways of supporting this:
- Only output the tokens without newline tokens and add the newline tokens in a 
second run, e.g. by a UIMA AE
- Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, 
or a span with this as the type) 


was (Author: joern):
The current implementation of the PTB tokenizer outputs newline tokens, but the 
OpenNLP tokenizers don't support this yet.

There are two ways of supporting this:
- Only output the tokens without newline tokens and add the newline tokens in a 
second run, e.g. by a UIMA AE
- Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, 
or a span with this as the type) 

> Add TokenizerPTB
> ----------------
>
>                 Key: OPENNLP-546
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-546
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Tokenizer
>            Reporter: Pei Chen
>            Priority: Minor
>              Labels: help-wanted
>
> Add Tokenizer based on Penn Tree Bank rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OPENNLP-546) Add TokenizerPTB

Reply via email to