[tesseract-ocr] Convert HOCR output to HTML with positioning

2015-08-30 Thread gonx
Hi, Is there a way to output the HOCR tesseract generates into a good HTML5 page complete with the text's positioning and font style ? Or best to just read the bbox coordinates as is and output to an HTML5 ? div class='ocr_carea' id='block_2_8' title=bbox 1165 1335 1644 1358 p

Re: [tesseract-ocr] Successfully installed and run Tesseract on Ubuntu, but can't find baseapi.h file to include ...

2015-08-30 Thread Sriranga(82yrsold)
which version of ubuntu on which tesseract installed. also indicate version of tesseract-ocr - since I want to install on ubuntu 15.04. On Sun, Aug 30, 2015 at 10:20 AM, fsbo.cons...@gmail.com wrote: are there different types of installations of which I have chosen the wrong one? The

Re: [tesseract-ocr] Successfully installed and run Tesseract on Ubuntu, but can't find baseapi.h file to include ...

2015-08-30 Thread fsbo . consult
Tesseract: tesseract-ocr: Installed: 3.03.02-3 Ubuntu: Ubuntu 14.04.3 LTS Also, just to make sure I'm not missing something, is there a distinction between tesseract-ocr and tesseract? On Sunday, August 30, 2015 at 1:59:00 AM UTC-7, sriranga(82yrsold) wrote: which version of ubuntu on

[tesseract-ocr] Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-08-30 Thread hmmwhatsthi...@gmail.com
Hello everyone, I have a digital copy of a book I own that was delivered to me in what might be the most inconvenient of formats - one PDF per page, with all non-image data on the page - text included - converted to vector shapes. While I can re-combine the pages together, add bookmarks/page

Re: [tesseract-ocr] Yoruba OCR

2015-08-30 Thread Victor Williamson
The links you gave me are great. I created the tiff/box pair on a mac as follows: raining/text2image --text=yor.training_text --outputbase=yor.VerdanaMedium.exp0 --font='Verdana Medium' --fonts_dir=/Library/Fonts Then I ran training as follows: tesseract yor.VerdanaMedium.exp0.tif