[Bug 623438] Re: Font size not correct in merged sandvich PDF

2015-10-10 Thread Merlin
I can confirm that Rudolf (rk-com)'s and George Chriss (gschriss)'s fix works. Thanks! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/623438 Title: Font size not correct in merged sandvich PDF To

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2015-07-22 Thread Rudolf
Many thanks to George Chriss! (see above) My workaround based on his description: Modify the created hocr by XSLT (see below). Then using hocr2pdf 0.8.9 - and the textboxes are placed (almost) correctly. $ tesseract image.tif ocr_file hocr $ xsltproc -html -nonet -novalid -o ocr_fixed.hocr

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2013-07-07 Thread George Chriss
Treating Comment #1 as works as intended (with a character precision limitation) and Bug #632524 as broken (font size/placement has no correlation to underlying text + out-of-bounds/missing/dog-piled text), I'm happy to report the following: While developing a new Inkscape extension to export

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2013-07-07 Thread George Chriss
Link to Inkscape Extension 'Export Image Overlay Text as hOCR' mentioned in Comment #58: https://bugs.launchpad.net/inkscape/+bug/1069248 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/623438 Title:

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2013-06-30 Thread jswinner
** Changed in: exactimage (Ubuntu) Status: New = Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/623438 Title: Font size not correct in merged sandvich PDF To manage notifications

Re: [Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-08-10 Thread Martin Wildam
On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen jussi.pakka...@canonical.com wrote: I'd like to remind everyone that Cuneiform is currently unmaintained. No-one is working on this or any other bug. Sad, but I had such an impression already. As far as I can see the one and only OCR option for Linux

Re: [Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-08-10 Thread Igor Filippov
To be fair there are also OCRAD, GOCR, and Tesseract. Igor On Wed, 2011-08-10 at 08:53 +, Martin Wildam wrote: On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen jussi.pakka...@canonical.com wrote: I'd like to remind everyone that Cuneiform is currently unmaintained. No-one is working on

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-08-08 Thread Jussi Pakkanen
I'd like to remind everyone that Cuneiform is currently unmaintained. No-one is working on this or any other bug. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/623438 Title: Font size not correct

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-08-07 Thread Luis Mendes
I've installed exactimage 0.8.6 from source and verified that it still can't cope with new cuneiform hocr file format. Latest cuneiform version that still outputs old format is the 0.8.0. I had to revert to that version to get usable results. Since both cuneiform and hocr2pdf are needed to get

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-06-22 Thread Emmanuel Pirsch
I'm having similar issue. I can confirm that it is not related to Cuneiform. I'm using ocropus (ocroscript recognize) (which uses Tesseract) and I have check the resulting .html (hocr) which seems valid and pixel perfect. However, hocr2pdf misalign the text with their related bounding boxes.

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2011-03-11 Thread Stuart Whitman
I just recently discovered this issue and wonder what is the final disposition? I read all the comments, but I am still unsure what is going to happen. Has it been determined that cuneiform is producing hocr standard compliant output and the issue is with hocr2pdf? Based on what I have read in the

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-10-17 Thread julien
Let me first summarize the cuneiform specific issues / proposed changes from Martin Wildam's conversation with Rene Rebe. 1) rev 413 to 415 completely changed the way bounding box info is written, now bbox per line and additional array of x start position, missing y height for proper font size

Re: [Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-10-17 Thread Jakub Wilk
I find the specification somewhat difficult to interpret at times but it is my understanding that character bbox info goes within the ocr_line tag element. whether it goes before or after the textual elements is irrelevant. E.g. span class='ocr_line' id='line_18' title=bbox 363 1253 581

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-10-17 Thread julien
Jakub Wilk, as you can see in any hocr output, the span is closed, I was sloppy when I copy pasted to the post. I have run the produced hocr output from cuneiform through http://validator.w3.org/check and it validates just fine. As for the span class='ocr_line'...Some textspan

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-10-17 Thread julien
I will have to change the ocr_cinfo span anyway.. to fix the whitespace bbox and also, I have noted that cuneiform occasionally gives control codes as part of the text. Not sure when I will have time to make the changes, but in any case, we could agree on what the format should be and then

Re: [Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-10-17 Thread Jakub Wilk
Example: span class='ocr_line' id='line_1' title=bbox 0 0 45 20span class='ocr_xword' id='xword_1' title=bbox 0 0 20 20span class='ocr_cinfo' title=x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...hello/span/spanspan /spanspan class='ocr_xword' id='xword_2' title=bbox 25 0 45 20span class='ocr_cinfo'

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-18 Thread jswinner
Similar problems when using Ocropus -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-15 Thread Ahmad Jagot
This bug also affects me. Would it be possible to add a command-line switch which allows reverting to the older bounding box format? Have downgraded to 0.8 for the time being... -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Martin Wildam
I have got in touch with the developer - he has very much todo, but I sent a donation and he looked at the issue (I exchanged a few emails with him) - here is his final response so far: On Mon, Sep 13, 2010 at 10:28, Rene Rebe r...@exactcode.de wrote: Dear Martin, the problem is that the latest

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Yury V. Zaytsev
** Changed in: cuneiform-linux Status: Invalid = Confirmed -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Yury V. Zaytsev
I am not entirely convinced about his arguments about UTF-8 and whitespace (sounds like just being lazy to adopt the parser to hOCR specs), but the loss of information about y-coordinates, which used to be present in the output of the previous versions sounds very much like a bug (if it's indeed

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Martin Wildam
How will you proceed now regarding this issue? -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Yury V. Zaytsev
I reopened the bug and maybe Jussi or someone who cares will have a look. -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-13 Thread Martin Wildam
Here is another note from René: On Mon, Sep 13, 2010 at 11:53, Rene Rebe r...@exactcode.de wrote: Note that I wrote the initial hOCR annotation in cuneiform, ... :-) If they desperately want to keep this new format, one could add 2 different hOCR formats, like hocr and hocr-detailed or so to

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-11 Thread Yury V. Zaytsev
I am not aware of any open source OCR software that is doing multi- column document recognition. It's more of a segmentation task, rather than recognition itself, so it should be rather implemented in a front- end, such as OCRopus. If you have a linear text flow, sandwich PDFs can be read by a

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Yury V. Zaytsev
** Also affects: exactimage (Ubuntu) Importance: Undecided Status: New -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Yury V. Zaytsev
The bug against exactimage is not going to be processed, as this package is autosynced from Debian, so the way it will work is as follows: one day someone from Ubuntu will report it against Debian, and few years later a Debian Developer will try to report it to upstream. It is possible to change

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Martin Wildam
Yes, I used the company number. And I already sent them an email. So far now response. I followed now your advice to subscribe to the mailing list and will report the issue there - we will see if this works. Thank you for your assistance. -- Font size not correct in merged sandvich PDF

Re: [Cuneiform] [Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Igor Filippov
Martin, Have you tried other OCR engines which can generate hOCR output? I'm not sure all of them can but here are a few free and open source OCR engines I've run on Linux: GOCR OCRAD Tesseract Does this issue affect them as well? Best, Igor On Fri, 2010-09-10 at 11:45 +, Martin Wildam

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Yury V. Zaytsev
I don't understand your question. Can you formulate it using no more than 75 words? -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Martin Wildam
@Igor: I searched quite a while - don't remember ocrad explicitely now but I am quite sure I came across it. I also found at other places (blog posts) that cuneiform seems to be the only one producing hocr output. I would be glad if there would be more choices. I have written a common file

Re: [Cuneiform] [Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Igor Filippov
Martin, I'm not using this functionality myself, so you most likely know best, but OCRAD is producing ORF output with -x command-line option. According to the README ORF file will contain bounding boxes for OCRed characters and lines. Igor On Fri, 2010-09-10 at 17:52 +, Martin Wildam wrote:

hOCR (was: [Bug 623438] Re: Font size not correct in merged sandvich PDF)

2010-09-10 Thread jsbien
On Fri, 10 Sep 2010 Martin Wildam 623...@bugs.launchpad.net wrote: @Igor: I searched quite a while - don't remember ocrad explicitely now but I am quite sure I came across it. I also found at other places (blog posts) that cuneiform seems to be the only one producing hocr output. This was

[Bug 623438] Re: Font size not correct in merged sandvich PDF

2010-09-10 Thread Martin Wildam
I could not find any documentation about how to get the hocr output back when I tested those OCR engines and after looking back now I can't find any documentation for ocropus or tesseract on how to produce the hocr html files. -- Font size not correct in merged sandvich PDF