Re: hocr2pdf and arabic language

Nick White Mon, 27 Jan 2014 02:46:00 -0800

I just tested hocr2pdf, and amazingly you're right, it doesn't seem
to support UTF-8. Which is pretty shocking.


> maybe you can try alternative solution ;-) [1]. It was created by google(I
> think ;-) ) and there is visible contributor e-mail if it does not work :-)
> 
> https://code.google.com/p/hocr-tools/source/browse/hocr-pdf

Zdenko's correct, this is much better. As was mentioned it isn't
documented. I'll try to correct this soon, but in the meantime some
pointers:

- It requires the 'reportlab' package for python. On a Debian based
  system the appropriate package is called 'python-reportlab'.
- I had to change line 46 from
  dpi = im.info['dpi']
  to
  dpi = im.info['dpi'][0]
- It expects .jpg and .hocr files, named the same per page, and in
  the same directory. It's then run like this:
  hocr-pdf my-directory

Hopefully that's enough to be getting along with. As I say I'll try
to write up a basic manpage for hocr-pdf, and make the fix on line
46 general enough to be applied.

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: hocr2pdf and arabic language

Reply via email to