What about using Apache Tika within cTAKES for this? Tika supports OCR through Tesseract:
http://wiki.apache.org/tika/TikaOCR Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Hari>, Sekhar <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, April 29, 2015 at 10:11 PM To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Subject: Image to text conversion >Hello All - > >I am looking for an OCR ability in cTAKES. The requirement is to convert >scanned image documents (ex: scanned hand written prescriptions) into a >text format. Then apply the usual NLP pipeline to convert the >unstructured text to a structured data. > >Can cTAKES convert scanned image documents into a text? If so, please >help me to understand this by sharing any documents or video. > >Many thanks, >Sekhar H. >
