The models for POS tagger and constituency parser use the implementations of 
OpenNLP. However, no OpenNLP models are used in cTAKES. The cTAKES models are 
trained on a combination of Penn Treebank, GENIA and clinical data (clinical 
data is about 500K words). Our experiments show maximized performance across 
the three corpora when the combined data is used.

Penn Treebank annotation guidelines were extended to the clinical domain to 
capture the specificities of the clinical language. That work was done in 
collaboration with the LDC. The extended guidelines are available here:
http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf

Hope this helps!
--Guergana

From: Masanz, James J. [mailto:[email protected]]
Sent: Friday, October 31, 2014 10:31 AM
To: [email protected]
Subject: RE: cTakes chunking problem.

There was some domain specific data already used in creating the POS and 
chunking models

For info on the chunker, see
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Chunker

Tokenization is rule-based within Apache cTAKES.
The default tokenizer is described here
http://ctakes.apache.org/apidocs/3.1.1/ctakes-core/org/apache/ctakes/core/nlp/tokenizer/TokenizerPTB.html


-- James

________________________________
From: Bala Krishnan [[email protected]]
Sent: Friday, October 31, 2014 2:25 AM
To: [email protected]
Subject: cTakes chunking problem.
Hi,

I have just have couple of clarifications. cTakes uses various NLP open source 
libraries for sentence tokenization, pos tagging and chunking. Can anyone tell 
me what is the trained model used for pos tagging, chunking ? Is it based on 
Genia corpus. I tried using genia tagger but it is giving me different results 
from the cTakes. Can anyone suggest me some ideas on incorporating domain 
specific corpora for tagging and chunking in cTakes ?

Regards,
Prasanna

Reply via email to