Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

Chris Tue, 25 Nov 2014 08:31:55 -0800

Hi,
no I have only tried with the ubuntu version.

Here are the samples: 
https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharing


for page in $(ls $1_out_*.tif); do
>     tesseract -l deu -psm 3 "$page" "$page" hocr
>     hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr"
> #    rm -rf $page
> done
>
> pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf"
>

Thank you,

Chris 



On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote:
>
> Have you tried with version compiled from latest source on git?
>
> If you post a couple of sample images I can give a try and let you know 
> what results I get.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sun, Nov 23, 2014 at 5:00 PM, Chris <[email protected] 
> <javascript:>> wrote:
>
>> Hi Ryan,
>> I run in the same problem. Do you have solved it?
>>
>> Best regards,
>>
>> Chris
>>
>>
>> On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote:
>>>
>>> Hi all,
>>>
>>> I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 
>>> LTS. When I use either hocr or the internal tesseract output for searchable 
>>> pdfs I get an oversized font that fills the page too quickly and does not 
>>> follow the text in the image.
>>>
>>> I scan the images as tiffs at 300 dpi, then clean up the images using 
>>> ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions 
>>> slightly altered. After that I perform the ocr. The output is there, but 
>>> the font is not aligned properly to the image, as stated above it makes the 
>>> font too large and so the text is cut off before the end, and the missing 
>>> text does not come up in a search.
>>>
>>> I'm using the stock tesseract package for Ubuntu 14.04. I tried 
>>> following the instructions to build the training packages but it errorred 
>>> out.
>>>
>>> Version info:
>>> tesseract --version
>>> tesseract 3.03
>>>  leptonica-1.70
>>>   libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 
>>> 1.2.8 : webp 0.4.0
>>>
>>> Here is a sample of my script for the ocr process using the output from 
>>> ScanTailor:
>>> #!/bin/bash
>>> # Run OCR on multiple PDF files and create a new pdf with the
>>> # extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
>>> # NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
>>> # Usage: ./makeit output.pdf
>>>
>>> set -e
>>> output="$1"
>>> dir=`pwd`
>>>
>>> # OCR each page individually and convert into PDF
>>> for page in "$dir"/*page*.tif
>>> do
>>>     base="${page%.tif}"
>>> #    tesseract "$page" "$base" -l isl hocr
>>>     tesseract "$page" "$base.pdf" -l isl     # I have also tried adding 
>>> -psm 4 here
>>> #    Tesseract now outputs searchable pdf on its own
>>> #    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
>>> done
>>>
>>> # combine the pages into one PDF
>>> gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" 
>>> "$dir"/*page*.pdf
>>>
>>> If anybody could please point out any error I have made or provide a 
>>> solution to this problem I would be very grateful. I am trying to get a 
>>> copy of a document to a professor of mine, where the original electronic 
>>> version of the document was lost. Searchable text is a desirable attribute 
>>> of the final result for her.
>>>
>>>
>>> Regards
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Searchable PDF output with oversized font

Reply via email to