The models for POS tagger and constituency parser use the implementations of OpenNLP. However, no OpenNLP models are used in cTAKES. The cTAKES models are trained on a combination of Penn Treebank, GENIA and clinical data (clinical data is about 500K words). Our experiments show maximized performance across the three corpora when the combined data is used.
Penn Treebank annotation guidelines were extended to the clinical domain to capture the specificities of the clinical language. That work was done in collaboration with the LDC. The extended guidelines are available here: http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf Hope this helps! --Guergana From: Masanz, James J. [mailto:[email protected]] Sent: Friday, October 31, 2014 10:31 AM To: [email protected] Subject: RE: cTakes chunking problem. There was some domain specific data already used in creating the POS and chunking models For info on the chunker, see https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.0+-+Chunker Tokenization is rule-based within Apache cTAKES. The default tokenizer is described here http://ctakes.apache.org/apidocs/3.1.1/ctakes-core/org/apache/ctakes/core/nlp/tokenizer/TokenizerPTB.html -- James ________________________________ From: Bala Krishnan [[email protected]] Sent: Friday, October 31, 2014 2:25 AM To: [email protected] Subject: cTakes chunking problem. Hi, I have just have couple of clarifications. cTakes uses various NLP open source libraries for sentence tokenization, pos tagging and chunking. Can anyone tell me what is the trained model used for pos tagging, chunking ? Is it based on Genia corpus. I tried using genia tagger but it is giving me different results from the cTakes. Can anyone suggest me some ideas on incorporating domain specific corpora for tagging and chunking in cTakes ? Regards, Prasanna
