Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Jeff Breidenbach
Tesseract produces searchable PDF directly. If you really want to use HOCR as an intermediate format, you can but you will need external software. There are a couple of "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable job tying things together. That said, going direct should

[tesseract-ocr] Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)

2017-08-27 Thread Jeff Breidenbach
Alexander Pozdnyakov has done a really good job packing Tesseract in his Personal Package Archive (PPA). I think it is getting to be time for wider usage, so I'm working with him to promote these to official packages. First step is Debian Experimental. That's a good place to work out problems,

Re: [tesseract-ocr] pdf -> searchable PDF

2017-01-20 Thread Jeff Breidenbach
There is a lengthy side discussion that is appropriate to move back here. I've been asked to elaborate what I mean by image extraction. https://github.com/tesseract-ocr/tesseract/issues/660 There are two ways to turn a PDF file into images. One is to render it, for example using a tool like

Re: [tesseract-ocr] makebox not working with --tessdata-dir argument

2016-08-13 Thread Jeff Breidenbach
I know from a separate email that you are using Debian GNU/Linux. The default location on Debian is /usr/share/tesseract-ocr/tessdata Therefore you need to either 1) do your work inside /usr/share/tesseract-ocr/tessdata, or 2) copy everything in /usr/share/tesseract-ocr/tessdata to

[tesseract-ocr] Re: Tesseract For PHP error - Error in pixReadMemPng: tmpfile stream not opened

2016-07-07 Thread Jeff Breidenbach
Go ahead and take this question to the tesseract-ocr-for-php developers. >From your error messages, you are running on a platform that doesn't support fmemopen. If Windows, then there is trouble with Leptonica's fallback function fopenWriteWinTempfile(). If Linux, then somehow PHP is restricting

[tesseract-ocr] Re: PDF/A versions

2016-01-15 Thread Jeff Breidenbach
My understanding is PDF/A requires a bit more metadata, for example some color profile information (ICC) and a description about where the data came from (XMP). Tesseract doesn't supply that, sorry. I have no reason to believe implementation is hard, it's just not something I'm currently

[tesseract-ocr] Re: append output file?

2016-01-15 Thread Jeff Breidenbach
There's the normal Linux way for appending things: tesseract image-1.png - >> results.txt tesseract image-2.png - >> results.txt tesseract image-3.png - >> results.txt ... Or perhaps you are thinking about support for streaming:

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2016-01-15 Thread Jeff Breidenbach
Hi all, I just want to mention that the copy of tesstrain.sh that ships with Ubuntu is slightly modified to make life a little easier. The very terse documentation is in the standard location. /usr/share/doc/tesseract/README.debian The modification saves some typing. This is an example of

[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-04 Thread Jeff Breidenbach
But I would like to see an example PDF - one of the simpler ones - just to see how the vector graphics were done. Please do not get your hopes up. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop

[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-04 Thread Jeff Breidenbach
This would be ridiculously hard to implement. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this

Re: [tesseract-ocr] Successfully installed and run Tesseract on Ubuntu, but can't find baseapi.h file to include ...

2015-09-03 Thread Jeff Breidenbach
sudo apt-get install tesseract-dev -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send

Re: [tesseract-ocr] building tesseract on windows using cygwin

2015-07-21 Thread Jeff Breidenbach
Not to mention the data corruption problem on stdout. Maybe wait another week or two for anything else to come up, and then declare 3.04.01? (Just to be clear, it doesn't matter from Debian's perspective; the stdout fix has already been patched there.) -- You received this message because you

Re: [tesseract-ocr] Re: Tesseract 3.04 Build Error

2015-07-18 Thread Jeff Breidenbach
Or bake some really delicious cookies for Tom Powers, who is in charge of Leptonica for Windows. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: Building tesseract without leptonica

2015-07-17 Thread Jeff Breidenbach
Forget it. Leptonica is a core requirement and provides the primary in memory image data structure, Pix. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: persian in tesseract-ocr

2015-07-17 Thread Jeff Breidenbach
I think 'fas' is the language code for Persian. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this

[tesseract-ocr] Re: Why is Tesseract so much more popular than Ocropus?

2015-07-17 Thread Jeff Breidenbach
Tesseract is more complete in terms of 'throw me an arbitrary document image and produce something useful' -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-07-17 Thread Jeff Breidenbach
JBIG2 is a mutlipage image format, but is different from - for example - multipage tiff because the images are not independently compressed. They share compression data, specifically a symbol dictionary. There are three possible approaches here: 1. Have Tesseract accept JBIG2 images produced

[tesseract-ocr] Re: Text output vs. PDF

2015-06-29 Thread Jeff Breidenbach
Unfortunately, I think there is nothing we can do. I've done everything I can to maximize compatibility with various PDF rendering engines, but Preview uses particularly terrible text extraction heuristics. To be fair, the root problem is the design and complexity of the PDF specification

[tesseract-ocr] Re: Tesseract 3.04 Build Error

2015-06-29 Thread Jeff Breidenbach
You need version 1.71 or later. Current leptonica release is 1.72. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To

[tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-06-29 Thread Jeff Breidenbach
Not available currently, and pretty major effort required to make it happen, both in Leptonica and Tesseract's PDF output module. No plans to work on this. For other formats we try hard to not re-encode during PDF generation whenever practical. -- You received this message because you are

[tesseract-ocr] Re: compile error under ubuntu 14.04

2014-09-09 Thread Jeff Breidenbach
This error comes from Leptonica 1.70. Tesseract now requires Leptonica 1.71. Leptonica 1.71 can be installed manually (but not so easily) and will ship with Ubuntu for their 14.10 release scheduled for October 23 of this year. -- You received this message because you are subscribed to the

[tesseract-ocr] Re: [tesseract-dev] Re: Training tools linking failure, icu_48::*

2014-08-01 Thread Jeff Breidenbach
Done. Bonus points if someone can remember to remove the instructions when they become obsolete in October. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I've merged Nick White's bugfix into hocr-tools. Thank you, Nick. I expect most people will instead use the native PDF support built into Tesseract henceforth, and I intend to focus most of my time and energy there. However, there is still some use for hocr-pdf, especially when working with

Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
As for Arabic and other right-to-left scripts, please try using the new native PDF capability in Tesseract instead. It is significantly more sophisticated and I think it should work correctly. -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group.

Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I don't know, it is up to Ray. My guess is quite soon. In any case, I just ran on your example images, noticed a small problem, and fixed it. Thank you for providing them. I should also mention that there is no need to convert your binary images to JPEG when using Tesseract's native PDF

Re: hocr2pdf and arabic language

2014-01-27 Thread Jeff Breidenbach
I am the author of the hocr2pdf utility. Thank you for the patch, I'll merge it some time next week. This week my focus is fixing some problem reports with the new native PDF output capability for Tesseract. Jeff -- -- You received this message because you are subscribed to the Google Groups