Hi Ravi,
   I think the problem is the different alignments for the images.  For 
whatever reason, tesseract is not correctly rotating the second tif file 
image2.tif, even with psm=1.  When I manually extract that image, manually 
rotate it and resave it, the OCR is of decent quality.
  I got decent quality when I used strategy 2 for OCR, which is to render the 
full page as a single image and then run OCR on that:

<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>


From: Allison, Timothy B. [mailto:[email protected]]
Sent: Wednesday, June 21, 2017 3:03 PM
To: '[email protected]' <[email protected]>
Cc: Ravi Gadapa <[email protected]>
Subject: RE: RE: Tesseract - OCR and Tika

Hi Ravi,
  Let’s keep the discussion as public as possible.  I won’t share the document 
that you sent to my personal email account, of course.
   In the email stream of my life, I missed your follow up email.  Thank you 
for the ping and the info.  I’ll take a look shortly.

From: Ravi Gadapa [mailto:[email protected]]
Sent: Wednesday, June 21, 2017 1:58 PM
To: Allison, Timothy B. <[email protected]<mailto:[email protected]>>
Subject: Re: RE: Tesseract - OCR and Tika

Just checking to see if you have any resolution for this.

Thx


Attached is the code i am using to run with english language package with 
attached file.

//
            Parser autoDetectParser = new AutoDetectParser();
            BodyContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);
            ParseContext context = new ParseContext();

            TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
            ocrConfig.setTesseractPath(tesseractbin);
            ocrConfig.setTessdataPath(tessdataFolder);
            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setExtractInlineImages(true);
            pdfConfig.setExtractUniqueInlineImagesOnly(false);

            context.set(Parser.class, autoDetectParser);
            context.set(TesseractOCRConfig.class, ocrConfig);
            context.set(PDFParserConfig.class, pdfConfig);

            log.info("OCR PARSING {} - START");
            log.info("Tesseract Data path: {} install path: {}", 
ocrConfig.getTessdataPath(),
                    ocrConfig.getTesseractPath());
            autoDetectParser.parse(stream, handler, new Metadata(), context);
            text = handler.toString();
            log.info("OCR DATA {}", text);
            log.info("OCR PARSING {} - END");
//


Thanks




________________________________
On Tuesday, June 20, 2017, 11:04:33 AM EDT, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:


Bouncing to user@

Are you able to share the document?

How are you running OCR exactly:
1) running OCR on extracted inline images
2) rendering page and then running OCR on the rendered image

What is the quality of the image?

Are you using the right language pack for the language?

-----Original Message-----
From: Mattmann, Chris A (3010) 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, June 20, 2017 10:02 AM
To: [email protected]<mailto:[email protected]>
Cc: Ravi Gadapa <[email protected]<mailto:[email protected]>>
Subject: Re: Tesseract - OCR and Tika

FWD’ing to the Tika list (note TO: address change)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF 
& Open Source Projects Formulation and Development Offices (8212) NASA Jet 
Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]<mailto:[email protected]>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate 
Professor, Computer Science Department University of Southern California, Los 
Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: Ravi Gadapa <[email protected]<mailto:[email protected]>>
Date: Monday, June 19, 2017 at 8:56 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Tesseract - OCR and Tika

I have been using it for our project and i seem to have problem extracting the 
data from pdf documents. Below is the sample how it extracts.

'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI 
ElElElfliOVdflNVW iNEIWdIflOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|flOE| NO GEISVEI 
EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIflOEI HO:| EIZIS ElSflzl TIV 'Z 'GEliON 
EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlflSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS 
iOEINNOOSIG HOOGiflO TIV 'L


Any suggestions

Thanks

Reply via email to