thank you ,nick!
you helped me so much,
the attached file is an script to automate the processing of ocr and 
compiling the pdf...

Warm Regards

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
#!/bin/sh

y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o %04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract -l ara -psm 3 $f ${f%.*} hocr 
  mv ${f%.*}.html ${f%.*}.hocr
done

cd ..
python hocr-pdf $name > a.pdf 
rm $name/*
rm -rf $name

Reply via email to