+1 really interested in the reply to this :) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Lewis John Mcgibbney <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, September 23, 2015 at 11:07 AM To: "[email protected]" <[email protected]> Subject: Paragraph Chunking in cTAKES >Hi Folks, > > >I am looking for some feedback on accuracy of cTAKES annotations over >input text if the input text is not properly formed paragraphs? > >Is this known to significantly affect annotation accuracy/performance? > >Does anyone have a 'golden' input example of where cTAKES works best for >annotation accuracy and performance? > > >My situation is as follows; right now I use Apache Tika to parse a >multitude of document and I feed the parse result from those documents >into cTAKES for annotation purposes. Sometimes Tika is not able to form >paragraphs correctly as the paragraphs are > split over a page. > > > >Another example is when footer information (such as page numbers, DOI's, >Journal names, etc.) exists between pages. > > >Thanks for any feedback. > >Lewis > > > >-- >Lewis > > > > > > > >
