Hi Ravi, Let’s keep the discussion as public as possible. I won’t share the document that you sent to my personal email account, of course. In the email stream of my life, I missed your follow up email. Thank you for the ping and the info. I’ll take a look shortly.
From: Ravi Gadapa [mailto:[email protected]] Sent: Wednesday, June 21, 2017 1:58 PM To: Allison, Timothy B. <[email protected]> Subject: Re: RE: Tesseract - OCR and Tika Just checking to see if you have any resolution for this. Thx Attached is the code i am using to run with english language package with attached file. // Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); ParseContext context = new ParseContext(); TesseractOCRConfig ocrConfig = new TesseractOCRConfig(); ocrConfig.setTesseractPath(tesseractbin); ocrConfig.setTessdataPath(tessdataFolder); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); pdfConfig.setExtractUniqueInlineImagesOnly(false); context.set(Parser.class, autoDetectParser); context.set(TesseractOCRConfig.class, ocrConfig); context.set(PDFParserConfig.class, pdfConfig); log.info("OCR PARSING {} - START"); log.info("Tesseract Data path: {} install path: {}", ocrConfig.getTessdataPath(), ocrConfig.getTesseractPath()); autoDetectParser.parse(stream, handler, new Metadata(), context); text = handler.toString(); log.info("OCR DATA {}", text); log.info("OCR PARSING {} - END"); // Thanks ________________________________ On Tuesday, June 20, 2017, 11:04:33 AM EDT, Allison, Timothy B. <[email protected]<mailto:[email protected]>> wrote: Bouncing to user@ Are you able to share the document? How are you running OCR exactly: 1) running OCR on extracted inline images 2) rendering page and then running OCR on the rendered image What is the quality of the image? Are you using the right language pack for the language? -----Original Message----- From: Mattmann, Chris A (3010) [mailto:[email protected]<mailto:[email protected]>] Sent: Tuesday, June 20, 2017 10:02 AM To: [email protected]<mailto:[email protected]> Cc: Ravi Gadapa <[email protected]<mailto:[email protected]>> Subject: Re: Tesseract - OCR and Tika FWD’ing to the Tika list (note TO: address change) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: [email protected]<mailto:[email protected]> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: Ravi Gadapa <[email protected]<mailto:[email protected]>> Date: Monday, June 19, 2017 at 8:56 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Tesseract - OCR and Tika I have been using it for our project and i seem to have problem extracting the data from pdf documents. Below is the sample how it extracts. 'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI ElElElfliOVdflNVW iNEIWdIflOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|flOE| NO GEISVEI EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIflOEI HO:| EIZIS ElSflzl TIV 'Z 'GEliON EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlflSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS iOEINNOOSIG HOOGiflO TIV 'L Any suggestions Thanks
