thank you ,nick!
you helped me so much,
the attached file is an script to automate the processing of ocr and
compiling the pdf...
Warm Regards
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
#!/bin/sh
y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o %04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
# extract text
tesseract -l ara -psm 3 $f ${f%.*} hocr
mv ${f%.*}.html ${f%.*}.hocr
done
cd ..
python hocr-pdf $name > a.pdf
rm $name/*
rm -rf $name