Bug#1023273: Old version is not working

2022-11-14 Thread James R Barlow
The current maintainer of ocrmypdf and pikepdf is looking for a new
maintainer, if someone else is able.


On Sun, Nov 13, 2022 at 6:06 AM Anton Gladky  wrote:

> The newer 14 version of ocrmypdf is needed to suppor the
> ghostscript 10.
>
> I have checked and can confirm, that 14.0.1 is working well.
>
> Regards
>
> Anton
>
>


Bug#980426: old pikepdf is blocking qpdf transition

2021-01-19 Thread James R Barlow
Jay, the patch looks good to me. I did not try to run it, but it looks like
it covers everything important.

Sean, there is no license change to pikepdf. ocrmypdf did have the license
change. If this is the holdup, ocrmypdf 10.x should work pikepdf 2.x with a
trivial patch - the main change in pikepdf 2.x was dropping support for
Python 3.5. If you are interested in the ocrmypdf 10.x + pikepdf 2.x
combination I can test it and do any patches. pikepdf is now used by other
applications in Debian (e.g. pdfarranger).

ocrmypdf 11's d/copyright was updated to reflect the current copyright
status.

It's been half a year since the ocrmypdf license change and I have not
heard anything from Debian about it. Is there a "queue" somewhere or
someone we can prod to complete this review?

Thanks for your help,
James


Bug#976092: ocrmypd fails autopkg tests in testing

2020-11-29 Thread James R Barlow
This is almost certainly a problem with how Debian is compiling or linking
ghostscript with libjbig2dec. This error would be reproducible with:

gs -sDEVICE=pngmono -o out.png any_pdf_that_contains_a_jbig2_image.pdf

Debian's test suite for ghostscript is just a simple smoke test, so
ocrmypdf frequently uncovers problems with ghostscript.

James

On Sun, Nov 29, 2020 at 7:42 AM Matthias Klose  wrote:

> Package: src:ocrmypdf
> Version: 10.3.1+dfsg-1
> Severity: serious
> Tags: sid bullseye
>
> ocrmypd fails autopkg tests in testing, but not in unstable. Looks like a
> missing break on some dependency?
>
> see https://ci.debian.net/packages/o/ocrmypdf
>
> [...]
> resources =
>
> PosixPath('/tmp/autopkgtest-lxc.fy1hic18/downtmp/build.Qgv/src/tests/resources')
> outdir =
> PosixPath('/tmp/pytest-of-debci/pytest-0/test_rotate_deskew_timeout0')
>
> def test_rotate_deskew_timeout(resources, outdir):
> check_ocrmypdf(
> resources / 'rotated_skew.pdf',
> outdir / 'deskewed.pdf',
> '--rotate-pages',
> '--rotate-pages-threshold',
> '0',
> '--deskew',
> '--tesseract-timeout',
> '0',
> '--pdf-renderer',
> 'sandwich',
> )
>
> correlation = check_monochrome_correlation(
> outdir,
> reference_pdf=resources / 'ccitt.pdf',
> reference_pageno=1,
> test_pdf=outdir / 'deskewed.pdf',
> test_pageno=1,
> )
>
> # Confirm that the page still got deskewed
> >   assert correlation > 0.50
> E   assert 0.0 > 0.5
>
> tests/test_rotation.py:214: AssertionError
> - Captured stderr call
> -
>
> Scanning contents:   0%|  | 0/1 [00:00 Scanning contents: 100%|██| 1/1 [00:00<00:00, 256.25page/s]
>
> OCR:   0%|  | 0.0/1.0 [00:00 OCR:  50%|█ | 0.5/1.0 [00:00<00:00,  1.62page/s]
> OCR: 100%|██| 1.0/1.0 [00:00<00:00,  3.19page/s]
>
> JPEGs: 0image [00:00, ?image/s]
> JPEGs: 0image [00:00, ?image/s]
>
> JBIG2: 0item [00:00, ?item/s]
> JBIG2: 0item [00:00, ?item/s]
> -- Captured log call
> ---
> INFO ocrmypdf.builtin_plugins.tesseract_ocr:tesseract_ocr.py:136 Using
> Tesseract OpenMP thread limit 2
> ERRORocrmypdf._exec.ghostscript:ghostscript.py:134 jbig2dec FATAL ERROR
> decoding image: incompatible jbig2dec header (0.18) and library (0.19)
> versions
>    Error
> reading a
> content stream. The page may be incomplete.
>
> Output may
> be incorrect.
>    Error: File
> did
> not complete the page properly and may be damaged.
>
> Output may
> be incorrect.
> INFO ocrmypdf._pipeline:_pipeline.py:401 with existing rotation ⇨,
> page is
> facing ⇧, confidence 0.00 - rotation appears correct
> ERRORocrmypdf._exec.ghostscript:ghostscript.py:134 jbig2dec FATAL ERROR
> decoding image: incompatible jbig2dec header (0.18) and library (0.19)
> versions
>    Error
> reading a
> content stream. The page may be incomplete.
>
> Output may
> be incorrect.
>    Error: File
> did
> not complete the page properly and may be damaged.
>
> Output may
> be incorrect.
> WARNING  ocrmypdf._pipeline:_pipeline.py:738 Some input metadata could not
> be
> copied because it is not permitted in PDF/A. You may wish to examine the
> output
> PDF's XMP metadata.
> INFO ocrmypdf.optimize:optimize.py:589 Optimize ratio: 1.00 savings:
> 0.0%
> INFO ocrmypdf._sync:_sync.py:381 Output file is a PDF/A-2B (as
> expected)
>  4 failed, 240 passed, 37 skipped, 1 xfailed in 359.28 seconds
> =
> autopkgtest [08:20:25]: test test-suite: ---]
> autopkgtest [08:20:25]: test test-suite:  - - - - - - - - - - results - -
> - - - - -
>
>


Bug#962872: ocrmypdf: new major upstream version available

2020-07-19 Thread James R Barlow
Sean and Rogério,

It's easiest for everyone if the difference between upstream and packages
is as small as possible, so I've been working on removing files that are
problematic for Debian.

In recent releases I have been removing all files that were previously in
Files-Excluded, except for:
pikepdf:docs/images/save-pike.jpg - public domain image of a sign likely
produced by a government agency in Ireland
ocrmypdf:docs/logo/logo - as we previously discussed, the .svg is now the
master version of the logo, and can be edited by open source tools.

In ocrmypdf, there are no new test resources since 9.8. I believe that the
patch that drops a test in tests/test_metadata.py can also be removed -
this previously used a resource with problematic copyright status, which is
probably why it was added.

In pikepdf, there are a few synthetic files I generated, and
pikepdf:tests/resources/jbig2global.pdf is a PDF'd copy of ocrmypdf:
tests/resources/typewriter.png. disable test_icc_extract.patch

can also be dropped, since the resource this used has been replaced with an
image I generated.

I updated debian/copyright in both projects at the HEAD revision (not a
tagged release). These files should reflect the current status.

I believe this means the updates shouldn't be too difficult, and also that
the -dfsg version tag could be dropped from both packages. (pikepdf is now
powerful enough that I can usually synthesize problematic constructs
instead of adding another test resource.)

James

On Sat, Jul 18, 2020 at 12:06 PM Sean Whitton 
wrote:
>
> Hello Rogério,
>
> On Mon 15 Jun 2020 at 09:13AM -03, Rogério Brito wrote:
>
> > A new major upstream version (10.0.1) of ocrmypdf was released a few
days
> > ago and it is *so much faster* than the previous versions 8.x, 9.x,
> > especially during the (painful) initial step of "Scanning".
> >
> > I installed it via pip in a virtual environment and it works very well
and
> > many hours of users will be saved if this new version is made available
for
> > users of Debian in general.
>
> Thank you for letting me know about the speed improvements.
>
> The main thing blocking updating both pikepdf and ocrmypdf -- which I
> try to do together since upstream is the same -- is updating d/copyright
> for all the new test resources which are included.
>
> This often requires looking up licenses on commons.wikimedia.org, and
> adding new files to Files-Excluded:.
>
> Perhaps you would be interested in helping out?
>
> What you would need to do is something like `git diff --name-status
> --diff-filter=ADR v1.13.0..v1.17.2` (versions are for pikepdf) and then
> work on a patch to d/copyright.
>
> All the other parts of the packaging, including actually applying
> Files-Excluded:, I can deal with easily myself.
>
> --
> Sean Whitton


Bug#939044: ocrmypdf: autopkgtest not compatible with new pikepdf, ghostscript and/or pytest

2019-09-09 Thread James R Barlow
Sean Whitton and I confirmed the issue still occurs with Ghostscript 9.28rc2.

I reported the issue with Ghostscript here:
https://bugs.ghostscript.com/show_bug.cgi?id=701552

On Fri, Sep 6, 2019 at 1:58 AM Jonas Smedegaard  wrote:
>
> Quoting James R Barlow (2019-09-06 10:15:59)
> > On Thu, Sep 5, 2019 at 11:57 PM Jonas Smedegaard  wrote:
> > >
> > > Quoting Sean Whitton (2019-09-06 06:20:47)
> > > > On Sat 31 Aug 2019 at 03:58PM +02, Jonas Smedegaard wrote:
> > > >
> > > > > Possibly some of the other tools uses undocumented insecure
> > > > > ghostscript calls which was recently removed.
> > > > >
> > > > > To investigate that further, someone needs to extract the actual
> > > > > input (probably Postscript or PDF) and the exact command used to
> > > > > call ghostscript.
> > > >
> > > > This was indeed a problem and ocrmypdf upstream has fixed it in
> > > > the latest release.
> > >
> > > Ah, great that the cause has been located!
> > >
> > > ...and happy that my guess was correct :-)
> >
> > Not quite? ocrmypdf did not use any undocumented ghostscript calls. It
> > followed an example from Ghostscript's documentation almost verbatim
> > to generate a .ps from a template that tells Ghostscript to insert an
> > ICC profile, referenced by filename. Ghostscript 9.28 is disabling
> > access to all files from a .ps file unless safety is explicitly
> > disabled. So nothing undocumented or exploitable was happening. (But
> > it does make sense for Ghostscript to make the change.)
> >
> > It does mean any other software that uses Ghostscript to generate
> > PDF/X, PDF/E, or PDF/A is likely going to break as well with this
> > release.
>
> Thanks for the clarification - helps me not spread any further false
> information!
>
>  - Jonas
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private



Bug#939044: ocrmypdf: autopkgtest not compatible with new pikepdf, ghostscript and/or pytest

2019-09-06 Thread James R Barlow
On Thu, Sep 5, 2019 at 11:57 PM Jonas Smedegaard  wrote:
>
> Quoting Sean Whitton (2019-09-06 06:20:47)
> > On Sat 31 Aug 2019 at 03:58PM +02, Jonas Smedegaard wrote:
> >
> > > Possibly some of the other tools uses undocumented insecure
> > > ghostscript calls which was recently removed.
> > >
> > > To investigate that further, someone needs to extract the actual
> > > input (probably Postscript or PDF) and the exact command used to
> > > call ghostscript.
> >
> > This was indeed a problem and ocrmypdf upstream has fixed it in the
> > latest release.
>
> Ah, great that the cause has been located!
>
> ...and happy that my guess was correct :-)

Not quite? ocrmypdf did not use any undocumented ghostscript calls. It
followed an example from Ghostscript's documentation almost verbatim
to generate a .ps from a template that tells Ghostscript to insert an ICC
profile, referenced by filename. Ghostscript 9.28 is disabling access to
all files from a .ps file unless safety is explicitly disabled. So nothing
undocumented or exploitable was happening. (But it does make sense
for Ghostscript to make the change.)

It does mean any other software that uses Ghostscript to generate
PDF/X, PDF/E, or PDF/A is likely going to break as well with this
release.


> They've issued another pre-release yesterday - I hope to package that
> soon, maybe today.
>
>
>  - Jonas
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private



Bug#939530: ghostscript: 9.28rc1 regression interpreting valid PDFs

2019-09-05 Thread James R Barlow
Package: ghostscript
Version: 9.28~~rc1~dfsg-1
Severity: important
Tags: upstream

Dear Maintainer,


1. Ghostscript 9.28rc1 reports "Error reading content streams" when
interpreting PDFs considered valid by Acrobat, qpdf, verapdf and
previous versions of Ghostscript.

2. Ghostscript 9.28rc1 reports "Recursive XObjects" on
multiple-referenced images when interpreting PDFs considered valid by
Acrobat, qpdf, verapdf and previous versions of Ghostscript.

This test file demonstrates both issues.

https://github.com/jbarlow83/OCRmyPDF/raw/master/tests/resources/cardinal.pdf

A command such as will display the error messages

```
$ gs -q -sDEVICE=pngmono -o _.png cardinal.pdf

 Error reading a content stream. The page may be incomplete.
Output may be incorrect.
 Error: File did not complete the page properly and may be damaged.
Output may be incorrect.
 Error: Recursive XObject detected, ignoring "Im0", object number 14
Output may be incorrect.
 Error: Recursive XObject detected, ignoring "Im0", object number 14
Output may be incorrect.
 Error: Recursive XObject detected, ignoring "Im0", object number 14
Output may be incorrect.
```

The error messages appear for other values of sDEVICE. The first
problem just displays a spurious error message. The second problem
will cause images or other objects to be suppressed from Ghostscript's
output.


-- System Information:
Debian Release: bullseye/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.14.116-boot2docker (SMP w/4 CPU cores)
Kernel taint flags: TAINT_OOT_MODULE
Locale: LANG=C, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE=C (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: unable to detect

Versions of packages ghostscript depends on:
ii  libc6   2.28-10
ii  libgs9  9.28~~rc1~dfsg-1

ghostscript recommends no packages.

Versions of packages ghostscript suggests:
pn  ghostscript-x  

-- no debconf information



Bug#934035: ocrmypdf: FTBFS in stretch (failing tests)

2019-08-06 Thread James R Barlow
The issue here is that we have an old version of ocrmypdf (4.3.5) with a
backported version of Ghostscript (9.26) and the latter's behavior has
changed in a way that breaks the test.

I recommend disabling the test and documenting a caveat that certain
metadata may not be preserved in output files. This is arguably a fairly
minor loss of functionality.

On Tue, Aug 6, 2019 at 3:48 AM Santiago Vila  wrote:

> Package: src:ocrmypdf
> Version: 4.3.5-3
> Severity: serious
> Tags: ftbfs
>
> Dear maintainer:
>
> I tried to build this package in stretch but it failed:
>
>
> 
> [...]
>  debian/rules build-indep
> dh build-indep --with python3,sphinxdoc --buildsystem=pybuild
>dh_testdir -i -O--buildsystem=pybuild
>dh_update_autotools_config -i -O--buildsystem=pybuild
>dh_autoreconf -i -O--buildsystem=pybuild
>dh_auto_configure -i -O--buildsystem=pybuild
> I: pybuild base:184: python3.5 setup.py config
> Skipping external program tests because of --force
> running config
>debian/rules override_dh_auto_build
> make[1]: Entering directory '/<>'
> mkdir -p debian/.debhelper
> cp -R ocrmypdf debian/.debhelper
> sed -i debian/.debhelper/ocrmypdf/__init__.py -e \
> "s|^VERSION =.*|VERSION = \"4.3.5\"|"
> PYTHONPATH=debian/.debhelper sphinx-build docs html
> Running Sphinx v1.4.9
> making output directory...
> loading pickled environment... not yet created
> building [mo]: targets for 0 po files that are out of date
> building [html]: targets for 7 source files that are out of date
> updating environment: 7 added, 0 changed, 0 removed
> reading sources... [ 14%] cookbook
> reading sources... [ 28%] errors
> reading sources... [ 42%] index
> reading sources... [ 57%] installation
> reading sources... [ 71%] introduction
> reading sources... [ 85%] languages
> reading sources... [100%] security
>
> /<>/docs/installation.rst:2: WARNING: Duplicate explicit
> target name: "docker".
> /<>/docs/installation.rst:2: WARNING: Duplicate explicit
> target name: "docker".
> looking for now-outdated files... none found
> pickling environment... done
> checking consistency... /<>/docs/installation.rst:: WARNING:
> document isn't included in any toctree
> done
> preparing documents... done
> writing output... [ 14%] cookbook
> writing output... [ 28%] errors
> writing output... [ 42%] index
> writing output... [ 57%] installation
> writing output... [ 71%] introduction
> writing output... [ 85%] languages
> writing output... [100%] security
>
> generating indices... genindex
> writing additional pages... search
> copying images... [100%] bitmap_vs_svg.svg
>
> copying static files... WARNING: html_static_path entry
> '/<>/docs/_static' does not exist
> done
> copying extra files... done
> dumping search index in English (code: en) ... done
> dumping object inventory... done
> build succeeded, 4 warnings.
> dh_auto_build -O--buildsystem=pybuild
> I: pybuild base:184: /usr/bin/python3 setup.py build
> Skipping external program tests because of --force
> running build
> running build_py
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/unpaper.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/hocrtransform.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/pdfa.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/ghostscript.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/leptonica.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/tesseract.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/main.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/__init__.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/qpdf.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/__main__.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/pageinfo.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/data
> copying ocrmypdf/data/sRGB.icc ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/data
> generating cffi module
> '/<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/lib/_leptonica.py'
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/lib
> make[1]: Leaving directory '/<>'
>debian/rules override_dh_auto_test
> make[1]: Entering directory '/<>'
> python3 setup.py test
> Checking for tesseract >= 3.03...
> Found tesseract 3.04.01
> Checking for gs >= 9.15...
> Found gs 9.26
> Checking for unpaper >= 6.1...
> Found unpaper 6.1
> Checking for qpdf >= 5.0.0...
> Found qpdf 6.0.0
> running pytest
> running egg_info
> creating ocrmypdf.egg-info
> writing requirements to ocrmypdf.egg-info/requires.txt
> writing ocrmypdf.egg-info/PKG-INFO
> writing top-level names to ocrmypdf.egg-info/top_level.txt
> writing entry points to ocrmypdf.egg-info/entry_points.txt
> writing dependency_links to 

Bug#930761: Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2019-06-20 Thread James R Barlow
I should mention that ocrmypdf no longer has an immediate use for PyMuPDF -
I went the route of creating pikepdf, Python bindings for qpdf, and this is
library is now in Debian thanks to Sean's work on packaging it. I don't
think I would use PyMuPDF if it were available. pikepdf overlaps the core
of PyMuPDF but is not a competing library, offering low level PDF
manipulation and much in the less way of content generation.

As such the original basis for this ticket is gone. I just want to make
sure people aren't putting effort into this ticket for ocrmypdf alone. If
other people want to see PyMuPDF added to Debian for their own reasons,
that's quite fine.

-James

On Wed, Jun 19, 2019 at 10:28 PM Sean Whitton 
wrote:

> [updating CC to point to the newly-filed RFP]
>
> Hello Johannes,
>
> Thank you again for looking into this.
>
> On Sat 08 Jun 2019 at 09:52pm +0200, Johannes Schauer wrote:
>
> > after my struggles in #930212 I now figured out how to compile stuff
> against
> > the static library in libmupdf-dev. As a result I managed to package
> PyMuPDF.
> > You can see the result here:
> >
> > https://salsa.debian.org/python-team/modules/pymupdf
> >
> > It's mostly Lintian-clean and just waiting for somebody who feels like
> > maintaining it in the future. :)
>
> I've had a look at your repo.  I've got a few questions and comments
> (relevant to whomever wants to take ownership of the ITP and upload this
> to NEW).
>
> A tool called SWIG, with which I'm unfamiliar, was used to generate
> fitz/fitz.py from the files fitz/*.i, but this tool does not get run
> during the package build.  There could be updates to SWIG, including
> security updates, which would change fitz.py.  It seems to me that we
> want to run SWIG during the package build to ensure that fitz.py
> reflects all improvements made to SWIG since pymupdf upstream ran that
> tool when releasing pymupdf.
>
> Secondly, how do you foresee us triggering binNMUs to rebuild this
> package when the static library in libmupdf-dev changes?  We would need
> to be sure that this package gets rebuilt if a security update was made
> to libmupdf-dev, for example, or the vulnerable version of mupdf would
> still be present in this package.  PDF libraries tend to get CVEs, to
> put it mildly.  I'm worried we've the same sort of problem as discussed
> in #928227.
>
> Finally, I noticed that the project looks to be GPL-3 not GPL-3+, though
> I haven't looked through every file in the source.  I also haven't
> thought carefully about the implications of statically linking a project
> that's under the AGPL.  I think that we can do it, but I'm not sure
> quite what license the binary package will end up with, and quite how to
> document that in d/copyright.
>
> --
> Sean Whitton
>


Bug#908937: Info received (Bug#908937: ghostscript breaks ocrmypdf autopkgtest)

2018-09-16 Thread James R Barlow
v6.2.4 is now available.

On Sun, 16 Sep 2018 at 14:06 Debian Bug Tracking System <
ow...@bugs.debian.org> wrote:

> Thank you for the additional information you have supplied regarding
> this Bug report.
>
> This is an automatically generated reply to let you know your message
> has been received.
>
> Your message is being forwarded to the package maintainers and other
> interested parties for their attention; they will reply in due course.
>
> Your message has been sent to the package maintainer(s):
>  Sean Whitton 
>
> If you wish to submit further information on this problem, please
> send it to 908...@bugs.debian.org.
>
> Please do not send mail to ow...@bugs.debian.org unless you wish
> to report a problem with the Bug-tracking system.
>
> --
> 908937: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908937
> Debian Bug Tracking System
> Contact ow...@bugs.debian.org with problems
>


Bug#908937: ghostscript breaks ocrmypdf autopkgtest

2018-09-16 Thread James R Barlow
Yes, I can backport the 7.x fixes to the 6.x release branch upstream. The
result is a loss of the ability to set or retain PDF metadata fields other
than pure ASCII. All characters that can't be represented in ASCII are
getting replaced with '?'.

(As far as I can tell, the Ghostscript commit intended to removing Unicode
metadata in DOCINFO for only PDF/A-1 but removed it for PDF/A-2 and A-3 as
well, and this was likely not what they intended. I will report that
separately in their bug tracker. It doesn't need to be dealt with in
Debian.)


On Sun, 16 Sep 2018 at 12:16 Sean Whitton  wrote:

> Hello,
>
> On Sun 16 Sep 2018 at 08:40PM +0200, Paul Gevers wrote:
>
> > Dear Sean,
> >
> > On 16-09-18 20:30, Sean Whitton wrote:
> >> Paul: does preventing regressions in testing take precedence?
> >
> > Normally, yes temporarily (we are not blocking yet), but ghostscript was
> > uploaded with urgency high. That means that regressions are ignored and
> > without an RC bug, ghostscript will migrate to testing tomorrow (if my
> > counting is correct).
>
> Thanks.  I should have been clearer that I was asking about /all/
> regressions in testing, rather than just autopkgtests.
>
> >> If so,
> >> this bug should be reassigned to ghostscript and raised to RC severity.
> >
> > I don't follow what you mean by this sentence.
>
> I meant that if preventing a test suite failure in testing took
> priority, we would want to stop ghostscript from migrating.  Anyway,
> it's moot.
>
> > ocrmypdf can stay in testing as long as it doesn't have an RC bug itself.
> >
> > So just to make it clear: if this change making ocrmypdf totally
> > unusable and you still want ghostscript to migrate to testing to fix
> > multiple CVE's, than assigning this bug to ocrmypdf and raising it to RC
> > level will start the autoremoval process. If you think it is worth
> > searching for a solution in ghostscript to avoid it breaking ocrmypdf,
> > than reassign this bug to ghostscript and raise the severity to RC level
> > to avoid migration.
>
> Right.
>
> I don't know if OCRmyPDF in testing counts as RC-buggy right now,
> because I don't know the ramifications of the GhostScript changes; I'm
> going to wait on upstream.
>
> --
> Sean Whitton
>


Bug#904185: img2pdf/0.3.0-1 appears to break ocrmypdf/6.2.2-2 autopkgtest: unexpected exit code

2018-08-01 Thread James R Barlow
Fixed in upstream 6.2.3. ocrmypdf will reject images that contain an alpha
channel if performing image to PDF conversion.


On Sat, 21 Jul 2018 at 03:30 Johannes Schauer  wrote:

> Control: reassign -1 src:ocrmypdf
>
> Quoting Paul Gevers (2018-07-21 10:42:51)
> > With a recent upload of img2pdf the autopkgtest of ocrmypdf started to
> fail
> > in unstable and testing. I copied the error below.
> >
> > Currently this regression is delaying the migration of img2pdf to
> > testing by 13 days. Could you please investigate the situation and
> > determine which package needs to fix something, and reassign
> appropriately?
>
> img2pdf now refuses to work on input that it cannot process without loss of
> information. PDF is not able to retain transparency information. That's why
> img2pdf refuses to work on input that contains an alpha channel. If you
> really
> want to process an image that has an alpha channel, then you have to
> remove it
> yourself before passing the image to img2pdf.
>
> This is a change that ocrmypdf has to implement.
>
> Thanks!
>
> cheers, josch
>


Bug#903627: ocrmypdf: contains workaround for old version of python3-ruffus which should not be used with current python3-ruffus

2018-07-12 Thread James R Barlow
I backported the fixes related to python3-ruffus 2.7, python 3.7 support,
and a few other minor changes from 7.0.0. I released it just now as 6.2.2,
so that should take care of it. Let me know if there are any further issues.


On Thu, 12 Jul 2018 at 01:03 Sean Whitton  wrote:

> Package: ocrmypdf
> Version: 6.2.0-1
> Severity: serious
> Tags: ftbfs
> X-debbugs-cc: j...@purplerock.ca
>
> OCRmyPDF contains a workaround for a bug in python3-ruffus <=2.6.3 that
> upstream reports should not be used with python3-ruffus >=2.7 (see
> changelog entry for 4.1.2-1 upload).
>
> python3-ruffus 2.7 was just uploaded to Debian, so ocrmypdf is now
> buggy, and indeed unbuildable.
>
> The current upstream release of OCRmyPDF, 7.0.0, will not be reaching
> Debian unstable for some time: a new dependency, pikepdf, will target
> experimental.  So ideally we would patch the workaround out of OCRmyPDF
> 6.2.0.  CCing upstream to request advice on how to do this.
>
> --
> Sean Whitton
>


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2018-04-25 Thread James R Barlow
Great to see the most recent test run passed, even if it is with liberal
application of "expect failure". The Canonical powers that be should be
appeased for the moment. I appreciate the last minute effort to get this,
and Tesseract, into the next Ubuntu.

Thinking about the failures, I suspect that the endian issues are now
within Tesseract not Leptonica. test_deskew passes, and this test skips
Tesseract entirely. It uses Leptonica to deskew a monochrome image and
confirm it was deskewed. I think it's extremely unlikely all that bit
twiddling would work if Leptonica were in the wrong endian. (Although there
could be individual Leptonica APIs might not work on big endian.)

The failure that surprised me is "test_tesseract_config_notfound". It
passes Tesseract a configuration file that doesn't exist, but it turns out
Tesseract proceeds with OCR rather than aborts in this case, so this isn't
informative.

Based on the failures I suspect the following command line will exit with
SIGSEGV:
  tesseract -l eng -c textonly_pdf=1 --user-words wordlist.txt
tests/resources/crom.png out pdf txt

where wordlist.txt is a file containing some words separated by newlines
and tests/resources/crom.png is distributed with OCRmyPDF. (If it does work
the PDF will be a blank page containing text with no image.)

Most of the failing tests have something to do with setting non-default
configuration variables for Tesseract.


Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2018-03-30 Thread James R Barlow
Hello Sean,

As promised ocrmypdf v6.1.2 makes pymupdf optional but recommended. My
continuous integration tests check with and without pymupdf.

The only major regression without pymupdf is that with all of:
1) an input file containing a mix of scanned and born digital files
2) --skip-text (not default)
3) --output-type pdf (not default)
the output file can grow extremely large compared to the input. Past
versions of ocrmypdf have had this issue for a long time, and now it will
produce a warning.

So it should be ready for Debian.

Thanks.


On Mon, 26 Mar 2018 at 14:30 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Mon, Mar 26 2018, James R Barlow wrote:
>
> > Thanks for the information. That's a worryingly high wall to climb and
> > I'm concerned about implications for other platforms as well.
> >
> > I would appreciate if you can see about getting an exception, but I
> > think I will change PyMuPDF to an optional but recommended dependency
> > fairly soon.
>
> That would be great in the meantime.
>
> > I haven't made a major investment in it as yet with new code, but it
> > does provide some powerful features that would be a major engineering
> > effort to replicate and are likely not going to materialize in another
> > open source library anytime soon. (Specifically: incremental updates,
> > safe editing of PDF/A, PDF object garbage collection, fast
> > rasterizing, robust text extraction.) The most commonly used Python
> > PDF library, PyPDF2, is essentially unmaintained and in poor shape.
>
> Having thought some more, I think our best bet will be to try to get
> pymupdf to support linking against the static version of mupdf.  We have
> techniques in Debian to deal with security updates in that case (called
> binNMUs if you want to look them up).
>
> --
> Sean Whitton
>


Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2018-03-26 Thread James R Barlow
Thanks for the information. That's a worryingly high wall to climb and I'm
concerned about implications for other platforms as well.

I would appreciate if you can see about getting an exception, but I think I
will change PyMuPDF to an optional but recommended dependency fairly soon.
I haven't made a major investment in it as yet with new code, but it does
provide some powerful features that would be a major engineering effort to
replicate and are likely not going to materialize in another open source
library anytime soon. (Specifically: incremental updates, safe editing of
PDF/A, PDF object garbage collection, fast rasterizing, robust text
extraction.) The most commonly used Python PDF library, PyPDF2, is
essentially unmaintained and in poor shape.


On Mon, 26 Mar 2018 at 10:32 Sean Whitton <spwhit...@spwhitton.name> wrote:

> control: reassign -1 wnpp
> control: forcemerge 841404 -1
>
> Dear James,
>
> On Mon, Mar 26 2018, James R. Barlow wrote:
>
> > In v6.0.0, which addresses and hopefully fixes #888917, I have
> introduced a
> > new dependency on PyMuPDF (Python bindings for MuPDF).  Unfortunately
> PyMuPDF
> > isn't available in Debian as yet (I have checked there is no
> python3-pymupdf).
>
> Thank you for working on that bug!
>
> > The build procedure should go like this:
> >
> >   - download/unpack MuPDF to mupdf/
> >   - download/unpar PyMuPDF to pymupdf/
> >   - cp pymupdf/fitz/_mupdf_config.h mupdf/include/mupdf/fitz/config.h
> >   - export CFLAGS=-fPIC
> >   - make HAVE_X11=no HAVE_GLFW=no HAVE_GLUT=no
> >   - patch pymupdf/setup.py to point library_dirs and include_dirs to the
> > output of mupdf/ build
> >
> > The reason for this circumlocution is that the vendor of MuPDF, Artifex,
> > does not provide or support dynamic libraries or a stable ABI, and
> > compiling the Python bindings requires a dynamic library.  Perhaps as a
> way
> > to warn people about their stance, they don't enable -fPIC by default and
> > link their application statically.
> >
> > This means that unfortunately, one cannot link to libmupdf-dev (and
> > actually, I'm not sure if libmupdf-dev serves any purpose at all, unless
> > it were rebuilt with -fPIC).  Certainly if the maintainers of this
> > package could be persuaded to build it with -fPIC that would make this
> > much easier.
> >
> > I did try to build with it with Debian sid against the libmupdf-dev
> > library. The error, as with Ubuntu, is:
> >   relocation R_X86_64_PC32 against symbol `fz_empty_irect' can not be
> > used when making a shared object; recompile with -fPIC
>
> If we followed this build procedure then there would be two copies of
> mupdf in Debian: in the mupdf package and in the pymypdf package.  This
> is disallowed by Debian Policy 4.13 because it makes it harder to apply
> security updates.  And for any libraries that read PDFs, security is a
> real concern (see the mupdf bug list: there have been several CVEs).
>
> Further, Policy 10.2 explicitly forbids building static libraries with
> -fPIC, unless an exception thought warranted in discussion on our main
> development mailing list.
>
> I think this might be a case where we can argue for that exception, but
> I'm going to need to do some reading before I can be sure that that's
> what we should do.  So unfortunately OCRmyPDF is probably going to be
> out-of-date in Debian for a while.
>
> > I'm happy to help with the packaging of this dependency, and I got it the
> > process working for Python binary wheels already.  However, I don't
> really
> > know much about Debian processes and policy.
>
> Thanks.  The best place to start would probably be our libraries policy,
> section 10.2 of Debian Policy.[1]  And the following relevant bugs:
> #617253, #719351, #841403.
>
> [1]  https://www.debian.org/doc/debian-policy/
>
> --
> Sean Whitton
>


Bug#888917: ocrmypdf fails to run it's testsuite

2018-03-26 Thread James R Barlow
v6.0.0 should fix this issue, as it includes a cache that allows most OCR
to be skipped.


Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2018-03-25 Thread James R. Barlow
Package: ocrmypdf
Version: v6.0.0
Severity: serious
Tags: newcomer
Justification: fails to build from source (but built successfully in the past)

Dear Sean,

In v6.0.0, which addresses and hopefully fixes #888917, I have introduced a
new dependency on PyMuPDF (Python bindings for MuPDF).  Unfortunately PyMuPDF
isn't available in Debian as yet (I have checked there is no python3-pymupdf).

The build procedure should go like this:

  - download/unpack MuPDF to mupdf/
  - download/unpar PyMuPDF to pymupdf/
  - cp pymupdf/fitz/_mupdf_config.h mupdf/include/mupdf/fitz/config.h
  - export CFLAGS=-fPIC 
  - make HAVE_X11=no HAVE_GLFW=no HAVE_GLUT=no
  - patch pymupdf/setup.py to point library_dirs and include_dirs to the
output of mupdf/ build

The reason for this circumlocution is that the vendor of MuPDF, Artifex, 
does not provide or support dynamic libraries or a stable ABI, and 
compiling the Python bindings requires a dynamic library.  Perhaps as a way
to warn people about their stance, they don't enable -fPIC by default and
link their application statically.

This means that unfortunately, one cannot link to libmupdf-dev (and 
actually, I'm not sure if libmupdf-dev serves any purpose at all, unless 
it were rebuilt with -fPIC).  Certainly if the maintainers of this 
package could be persuaded to build it with -fPIC that would make this 
much easier.

I did try to build with it with Debian sid against the libmupdf-dev 
library. The error, as with Ubuntu, is:
  relocation R_X86_64_PC32 against symbol `fz_empty_irect' can not be 
used when making a shared object; recompile with -fPIC

The make options and replacement of the header file in mupdf are all 
disabling features unnecessary for PyMuPDF's purposes. It shrinks the 
binary from 30 MB to 3 MB.

The PyMuPDF developers describe their build process here:
https://github.com/rk700/PyMuPDF/wiki/Ubuntu-Installation-Experience

I'm happy to help with the packaging of this dependency, and I got it the
process working for Python binary wheels already.  However, I don't really
know much about Debian processes and policy.

Regards,
James

-- System Information:
Debian Release: buster/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.4.119-boot2docker (SMP w/1 CPU core)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968), LANGUAGE=C 
(charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /usr/bin/dash
Init: unable to detect

Versions of packages ocrmypdf depends on:
pn  ghostscript   
pn  icc-profiles-free 
pn  liblept5  
ii  python3   3.6.5~rc1-1
pn  python3-cffi-backend-api-max  
pn  python3-cffi-backend-api-min  
pn  python3-img2pdf   
pn  python3-pil   
ii  python3-pkg-resources 39.0.1-1
pn  python3-pypdf2
pn  python3-reportlab 
pn  python3-ruffus
pn  qpdf  
pn  tesseract-ocr 
ii  zlib1g1:1.2.8.dfsg-5

Versions of packages ocrmypdf recommends:
pn  unpaper  

Versions of packages ocrmypdf suggests:
pn  img2pdf  
pn  ocrmypdf-doc 
pn  python-watchdog  



Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-15 Thread James R Barlow
Tesseract 4 is now in Debian unstable. When running on a processor that
lacks the AVX2 extensions (added at the Intel Haswell microarch, around
2013), it falls back on a slower version in SSE or something, which is so
much slower that it regularly hits the timeout. (Some of these failures are
less than graceful and I will fix that.)

The reason I think this is the case is that ci-worker[01,02].debian.net and
Sean's laptop consistently fail, and they fail on different tests at
different times all by hitting timeouts. My 2013 desktop which has a
Haswell processor and Matthias' new laptop are fine.

For the CI workers I looked over every pass and failure back to Jan 30, and
every test log that fails had worker 01 or 02. It wouldn't be surprising
for the lowest numbered boxes to be the oldest ones.
https://ci.debian.net/packages/o/ocrmypdf/unstable/amd64/

To confirm I compiled a version of Tesseract 4 with AVX2 disabled and using
the "best quality" training set. Results were as follows (ratios being
relevant).

Tesseract 4, AVX2, best quality training data: 5s
Tesseract 4, AVX2 disabled, best quality training data: 32s
Tesseract 4, AVX2 disabled, fast training data: 10s
Tesseract 3.05: 4s

So I will need to fix this because the test suite should be consistent even
if Tesseract isn't. I'll revise how the existing test cache works so that I
can ship precalculated OCR files with it.



On Mon, 12 Feb 2018 at 18:24 Sean Whitton <spwhit...@spwhitton.name> wrote:

> control: retitle -1 Test suite failures
>
> Hello James,
>
> On Fri, Feb 02 2018, James R. Barlow wrote:
>
> > Do you think you could take a few minutes to identify which test is
> > taking this long and report it? This may be an upstream bug, if some
> > input triggers an infinite loop.
>
> I ran the test suite on one of Debian's machines, in an up-to-date
> Debian unstable chroot.  It took 100 minutes and there were many
> failures.  Some of the test failed due to timeouts, and some of them
> failed for other reasons.  I'm attaching the full log.
>
> I see you have released 5.6.0, but from the release notes it seems
> likely there would be the same failures.
>
> Please let me know if you still need me to run individual tests and see
> how long they take.
>
> > I have my suspicions. My guess is that:
> >
> > pytest tests/test_qpdf.py # will never finish
> >
> > and
> >
> > pytest -n0 tests/test_qpdf.py # will fail in 15 seconds
> >
> > If so, you might have qpdf < 7.0.0 and upgrading to qpdf >= 7.0.0 will
> > fix it.
>
> We have qpdf 7.1.1 in Debian unstable right now, so this can't be it.
>
> --
> Sean Whitton
>


Bug#888917: ocrmypdf fails to run it's testsuite

2018-02-02 Thread James R Barlow
Hello Sean,

On Wed, 31 Jan 2018 22:06:42 -0700 Sean Whitton 
wrote:
> I further suspect that the test suite took 30 seconds only because so
> many tests failed early.  In recent upstream versions, the test suite
> has never finished running on my laptop after leaving it for multiple
> hours.  When you run the test suite on a totally ordinary file system,
> please report how long it takes, and whether your laptop is very
> new/high spec.

Do you think you could take a few minutes to identify which test is taking
this long and report it? This may be an upstream bug, if some input
triggers an infinite loop.

I have my suspicions. My guess is that:

pytest tests/test_qpdf.py # will never finish

and

pytest -n0 tests/test_qpdf.py # will fail in 15 seconds

If so, you might have qpdf < 7.0.0 and upgrading to qpdf >= 7.0.0 will fix
it.

But I'd appreciate if you can confirm.

Thanks.


Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-31 Thread James R Barlow
Upstream here.

The reason the suite fails like that is that mandatory-for-testing
dependencies were also removed.

The test suite runs on Travis CI in 10-12 minutes. On Debian CI, 15
minutes. For comparison ffmpeg, another compute intensive CLI program,
takes 10 minutes.

This is an OCR program and OCR takes a long time. There are opportunities
to speed up testing on my end but no low hanging fruit without removing
tests. I've done the obvious: use all cores, use caches and dummies where
possible. Some OCR on the fly is essential because Tesseract is complex
enough that output is not identical across platforms.

Preserving the dynamically created tests/cache/ folder between test runs,
if possible in Debian CI, would speed it up a lot.

I could mark a subset of essential tests for packagers so that Debian CI
can specify it only wants those. There's a number of tests that are very
unlikely to pass upstream testing (macOS and Ubuntu) then somehow fail
downstream in Debian.


Bug#849094: liblept5: Broken on s390x (+ other big endian archs)

2016-12-28 Thread James R Barlow
I'm the ocrmypdf upstream author.

First, be aware that the output of OCR and autorotate is cached in the test
suite and the results are persisted between test cases and runs of the test
suite in the tests/cache folder. The cache hit/miss check is not smart
enough to pick up changes that aren't reflected in leptonica's version
number, that is, debian changes. However, it looks to like me the test
suite is being run to target a temporary folder and that should remove
cache effects. Nuke the test/cache folder between test suite runs to be
sure.

All the failing tests relate to "check_monochrome_correlation", a function
that checks for close but not identical visual output compared to a
reference. Because of a now-fixed leptonica bug in one of the underlying
functions, I actually have a separate test that validates that this helper
function, and that passes on big endian.

The log shows that tesseract failed to properly detect page orientation and
came back with a low confidence answer. I interpret that to mean there are
endian issues in either tesseract or leptonica; the test isn't able to
distinguish.

It seems that the problem may be either a big endian issue in tesseract
alone (perhaps affecting multiple versions, since tesseract does not have
much a test suite) or it's some leptonica API that tesseract invokes while
doing a page orientation check. Tesseract's test suite is very limited and
probably doesn't check for consistency here.

I looks like the patch is safe to apply and would be a net improvement even
though it doesn't fix all of the issues my test suite finds.


You can check orientation (skipping full OCR) in tesseract 3.04.01 with:

$ tesseract -l eng -psm 0 test_image.png stdout

The output for LinnSequencer.jpg on my macOS-x64 machine is:

$ tesseract -l eng -psm 0 tests/resources/LinnSequencer.jpg stdout
Warning in pixReadMemJpeg: work-around: writing to a temp file
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 31.48
Script: Latin
Script confidence: 100.95

>From the logs, tesseract reports (orientation, confidence) = (0, 1.32) for
the same page on big endian, which means whatever data it is examining is
much noisier, i.e. probably corrupted by endian swizzling. Quite likely the
OCR output is garbage as well.

It might be interesting to see what the behavior differences are for
leptonica 1.73-patched, 1.74 and tesseract 3.04.01 and 4.00alpha all on big
endian. The results matrix from those combinations would probably indicate
whether to blame tesseract or leptonica.


Bug#813562: Test suite failure

2016-02-21 Thread James R Barlow
Great news. 4.0.2 is ready now.

I did find while updating my Docker image that Debian stretch's version of
Ghostscript (gs 9.16~dfsg-2.1) produces error messages and blank pages on
JPEG 2000 images. It's fixed in Sid, but the fix hasn't moved downstream
yet.

Thanks again for doing this.

On Sat, 20 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Sat, Feb 20, 2016 at 03:27:00AM +, James R Barlow wrote:
> > Thanks for your help. Output order is due to multiprocessing.
>
> No problem.
>
> > That nailed it. tesseract 3.04.01 changed its output when asked to
> > determine page orientation. It's an improved, but it breaks parsing.
> >
> > I will throw together a patch to make the appropriate distinctions.
>
> I thought you might appreciate knowing that version 4.0.2rc1 builds fine
> in a clean Debian Sid chroot, and the test suite passes as part of the
> package build.
>
> I'm looking forward to 4.0.2!  (Release candidates are not generally
> uploaded to Debian.)
>
> --
> Sean Whitton
>


Bug#815391: Ghostscript 9.16 (debian stretch) not linked against libopenjpeg

2016-02-20 Thread James R Barlow
Package: libgs9


Version: 9.16~dfsg-2.1

When asked to perform any operation on a PDF containing JPEG2000
(/JPXDecode) images, Ghostscript displays the following:

$ gs -dBATCH -dQUIET -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=_.pdf
anyjpeg2000.pdf

    ERROR: Unable to process JPXDecode data. Page will be missing data.

Ghostscript will leave any JPX encoded images blank but still produce the
rest of PDF output. Any PDF containing a JPX encoded image triggers this
behavior.

I believe the issue is that this particular version of libgs9 and/or
ghostscript was not built --with-jp2. The packages also do not list
libopenjpeg as a dependency. Ghostscript has supported JPEG 2000 for a long
time.

System:
Linux 576f25bbd6dc 4.1.17-boot2docker #1 SMP Thu Feb 11 08:12:31 UTC 2016
x86_64 GNU/Linux, stretch/sid
built "FROM debian:stretch" on hub.docker.com.


Bug#813562: Test suite failure

2016-02-19 Thread James R Barlow
Thanks for your help. Output order is due to multiprocessing.

That nailed it. tesseract 3.04.01 changed its output when asked to
determine page orientation. It's an improved, but it breaks parsing.

I will throw together a patch to make the appropriate distinctions.


$ tess-3.04.01 -psm 0 tests/resources/linn-west.jpg stdout
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 29.34
Script: Latin
Script confidence: 45.33

$ tess-3.04.00 -psm 0 tests/resources/linn-west.jpg stdout
Orientation: 3
Orientation in degrees: 90
Orientation confidence: 29.34
Script: 1
Script confidence: 45.33



On Fri, Feb 19, 2016 at 16:28 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 19, 2016 at 10:45:51PM +, James R Barlow wrote:
> > In any case, could you try running this:
> > ocrmypdf --rotate-pages tests/resources/cardinal.pdf out.pdf
> >
> > In cardinal.pdf the same page is rotated in each cardinal direction.
> out.pdf
> > should have all pages facing up. Is this the case? The output will also
> give
> > information on rotation status:
> > INFO - 1: page is facing ⇧, confidence 18.69
> > INFO - 3: page is facing ⇩, confidence 21.86 - correcting rotation
> > INFO - 4: page is facing ⇦, confidence 20.71 - correcting rotation
> > INFO - 2: page is facing ⇨, confidence 21.63 - correcting rotation
> > INFO - 3: rotating image layer 180 degrees
> > INFO - 2: rotating image layer 90 degrees
> > INFO - 4: rotating image layer 270 degrees
>
> No, it gets it wrong.  Result attached, and the output:
>
> ,
> | root@artemis:/build/ocrmypdf-4.0.1# ocrmypdf --rotate-pages
> tests/resources/cardinal.pdf out.pdf
> | INFO -1: page is facing ⇧, confidence 18.69
> | INFO -2: page is facing ⇦, confidence 21.63 - correcting rotation
> | INFO -3: page is facing ⇩, confidence 21.86 - correcting rotation
> | INFO -4: page is facing ⇨, confidence 20.71 - correcting rotation
> | INFO -2: rotating image layer 270 degrees
> | INFO -3: rotating image layer 180 degrees
> | INFO -4: rotating image layer 90 degrees
> `
>
> (note that the order it processes the pages in is different to your
> example)
>
> > It would also help to try in python3:
> >
> > >>> import ocrmypdf.leptonica as lp
> > >>> lp.getLeptonicaVersion()
> >
> > ...to see if there's anything unusual about how debian sid is reporting
> the
> > leptonica version.
>
> ,
> | root@artemis:/build/ocrmypdf-4.0.1# cd /usr/lib/python3/dist-packages
> | root@artemis:/usr/lib/python3/dist-packages# python3
> | Python 3.5.1+ (default, Jan 13 2016, 15:09:18)
> | [GCC 5.3.1 20160101] on linux
> | Type "help", "copyright", "credits" or "license" for more information.
> | >>> import ocrmypdf.leptonica as lp
> | >>> lp.getLeptonicaVersion()
> | 'leptonica-1.73'
> `
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-19 Thread James R Barlow
I ran into a similar failure because leptonica 1.71 has an integer overflow
bug in the function pixCorrelationBinary which I use only in the test suite
to check if some output PDFs visually resemble an expected reference PDF. I
rewrote that function in Python for the older versions. The relevant code
is ocrmypdf.leptonica.Pix.correlation_binary. I added a test that only
exercises pixCorrelationBinary (test_monochrome_correlation), and this one
passed.

I checked that the tests can pass in the Docker version (they are slightly
broken for an unrelated reason), which is debian stretch which has
leptonica 1.73 (good version) and the same set of libraries as yours. The
one difference is tesseract 3.04.01 vs .00, but I compiled the tesseract
3.04.01 and found that made no difference.

In any case, could you try running this:
ocrmypdf --rotate-pages tests/resources/cardinal.pdf out.pdf

In cardinal.pdf the same page is rotated in each cardinal direction.
out.pdf should have all pages facing up. Is this the case? The output will
also give information on rotation status:
INFO - 1: page is facing ⇧, confidence 18.69
INFO - 3: page is facing ⇩, confidence 21.86 - correcting rotation
INFO - 4: page is facing ⇦, confidence 20.71 - correcting rotation
INFO - 2: page is facing ⇨, confidence 21.63 - correcting rotation
INFO - 3: rotating image layer 180 degrees
INFO - 2: rotating image layer 90 degrees
INFO - 4: rotating image layer 270 degrees

That would help establish whether something is actually wrong or the test
case is somehow at fault.

It would also help to try in python3:

>>> import ocrmypdf.leptonica as lp
>>> lp.getLeptonicaVersion()

...to see if there's anything unusual about how debian sid is reporting the
leptonica version.


On Fri, 19 Feb 2016 at 12:04 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 19, 2016 at 07:11:32AM +, James R Barlow wrote:
> > What version of leptonica is installed?
> > tesseract --version will report this.
>
> From within my Sid chroot:
>
> root@artemis:/build/ocrmypdf-4.0.1# tesseract --version
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
> > Also what's the file name for liblept?
>
> The Debian liblept package provides:
>
> /usr/lib/liblept.so.5
> /usr/lib/liblept.so.5.0.0
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-18 Thread James R Barlow
I have seen a similar problem.

What version of leptonica is installed?
tesseract --version will report this.
Also what's the file name for liblept?
On Thu, Feb 18, 2016 at 21:29 Sean Whitton  wrote:

> Dear James,
>
> OCRmyPDF's test suite is currently failing under a freshly-installed
> Debian Sid chroot.  I've attached the output to this e-mail.
>
> Since the test suite worked on yesterday's version of Debian Sid, I
> think that this must be due to a bug introduced in a new version of one
> the dependencies.  That means it's my job to figure out what the problem
> is, and it is unlikely to be a bug in OCRmyPDF for you to fix.  I'm
> e-mailing you just in case the problem is obvious to you from reading
> the output.
>
> Thanks.
>
> --
> Sean Whitton
>


Bug#813562: Copyright clarification of some test resources files

2016-02-16 Thread James R Barlow
Yes, I'll tag a release when it's ready to go.

On Tue, 16 Feb 2016 at 17:26 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Tue, Feb 16, 2016 at 01:55:47PM +, James R Barlow wrote:
> > I updated my tree, eliminating one file whose status I couldn't
> determine,
> > replacing another with an equivalent free file and documenting the rest.
> The
> > file has also been renamed to README so that Github shows it
> automatically.
> > https://github.com/jbarlow83/OCRmyPDF/tree/develop/tests/resources
>
> Thank you for taking the time to clarify the status of those files.
>
> > Please note this is in my develop branch at the moment, not yet merged
> > into master until I fix a few broken test cases (unrelated reasons).
>
> Okay.  Are you planning to tag a release after you merge into master?
> If a new release is imminent I'll hold off uploading my package.  No
> pressure; just let me know your plans.
>
> > Does a Deb package usually run tests? From the point of view of saving
> > some disk space, maybe this is not wanted.
>
> Yes, if there is an upstream test suite then a Debian package should run
> it at package build time.  We also have a CI farm running tests and I
> intend to run OCRmyPDF's tests there too.
>
> It wouldn't help with disc space as Debian packages are meant to include
> an untouched copy of upstream releases.
>
> --
> Sean Whitton
>


Bug#813562: Copyright clarification of some test resources files

2016-02-16 Thread James R Barlow
I updated my tree, eliminating one file whose status I couldn't determine,
replacing another with an equivalent free file and documenting the rest.
The file has also been renamed to README so that Github shows it
automatically.
https://github.com/jbarlow83/OCRmyPDF/tree/develop/tests/resources

Please note this is in my develop branch at the moment, not yet merged into
master until I fix a few broken test cases (unrelated reasons).

Does a Deb package usually run tests? From the point of view of saving some
disk space, maybe this is not wanted.



On Mon, 15 Feb 2016 at 18:25 Sean Whitton  wrote:

> Dear James,
>
> Thank you for providing the file tests/resources/NOTE.rst in the
> OCRmyPDF repository.  Unfortunately, I can't account for the following
> files in the same directory:
>
> - francais.pdf
> - graph_ocred.pdf
> - graph.pdf
> - missing_docinfo.pdf
> - Test_Issue_28.pdf
>
> Do you know the copyright status if this files?  If not, I can exclude
> from from the Debian package and disable the relevant tests, but if you
> have reason to believe they are public domain or created by you or under
> a free license then it would be great if you could share that
> information.
>
> --
> Sean Whitton
>


Bug#813562: Project maintainer here

2016-02-13 Thread James R Barlow
I'd suggest putting ocrmypdf in a submodule with the Debian things outside.
Then setup tools should pick up tags correctly rather than generating
whatever "git describe" decides to call the current revision in your merged
repo.
On Sat, Feb 13, 2016 at 16:46 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> Thank you for your e-mail.
>
> On Sat, Feb 13, 2016 at 01:35:10AM +, James R Barlow wrote:
> > On Fri, 12 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name>
> wrote:
> > I have a non-packaging question that I'd like to take this
> opportunity
> > to ask you: in your changelog entry for 3.2, it's explained that the
> new
> > "lossless reconstruction" feature is disabled by --deskew and
> > --clean-final but otherwise PDF contents are now added to but not
> > modified by OCRmyPDF.  I had observed that OCRmyPDF makes my PDFs
> much
> > smaller without making them any harder to read, presumably by
> changing
> > the content, and I rather liked this feature.  Can I turn it back on?
> > Or was --clean-final doing this and turning that on would be enough?
> > Oh, interesting. By smaller I take it mean the file size was reduced, not
> > resampling of images. Any chance you can send me an example input PDF?
> (Dropbox
> > is best.)
>
> Sure, I'll do that once I can make my 3.2 package build.
>
> > If you build the package around a wheel or tarball obtained from PyPI,
> > setuptools_scm should be able to get the version out. It will fail to
> determine
> > the version from a Github tarball.
>
> I'm trying to build out of git: I have a branch with the Debian control
> files and I merged your 3.2 tag into that.  Do you know how I can make
> setuptools_scm successfully determine the version from that?  How do you
> do your builds during development?
>
> --
> Sean Whitton
>


Bug#813562: Project maintainer here

2016-02-12 Thread James R Barlow
Let me know if you'd like to see any changes to help with packaging.

If you are packaging around 3.1.1, versions older than 3.2.1 are
incompatible with the recently released img2pdf 0.2.0; they require 0.1.5,
and they do not enforce this dependency on their own. If you try to install
any of them now, they will pull in img2pdf 0.1.5 and break. You would need
to patch in img2pdf==0.1.5.

My current development branch adds a new dependency on cffi (libffi) to
access leptonica (also a tesseract dependency), and automatic fixing of
page rotation.

-JRB


Bug#813562: Project maintainer here

2016-02-12 Thread James R Barlow
On Fri, 12 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 12, 2016 at 03:23:39AM -0800, James R Barlow wrote:
> > Let me know if you'd like to see any changes to help with packaging.
>
> Thank you for your input, and for OCRmyPDF.
>
> I have a non-packaging question that I'd like to take this opportunity
> to ask you: in your changelog entry for 3.2, it's explained that the new
> "lossless reconstruction" feature is disabled by --deskew and
> --clean-final but otherwise PDF contents are now added to but not
> modified by OCRmyPDF.  I had observed that OCRmyPDF makes my PDFs much
> smaller without making them any harder to read, presumably by changing
> the content, and I rather liked this feature.  Can I turn it back on?
> Or was --clean-final doing this and turning that on would be enough?
>
>
Oh, interesting. By smaller I take it mean the file size was reduced, not
resampling of images. Any chance you can send me an example input PDF?
(Dropbox is best.)

I did increase the JPEG quality that Ghostscript uses when transcoding
JPEGs, mostly as an added safety margin, but I can make that optional.
Maybe that affects file size more than I thought.


> > If you are packaging around 3.1.1, versions older than 3.2.1 are
> > incompatible with the recently released img2pdf 0.2.0; they require
> > 0.1.5, and they do not enforce this dependency on their own.
>
> I've got a working package for 3.1 but I'm now trying to update my
> packaging for the 3.2 series before I try to find a sponsor DD to upload
> to Debian.  I'm figuring out how your change to use setuptools-scm can
> be made to work with the Debian toolchain.
>
>
If you build the package around a wheel or tarball obtained from PyPI,
setuptools_scm should be able to get the version out. It will fail to
determine the version from a Github tarball.

> My current development branch adds a new dependency on cffi (libffi) to
> access
> > leptonica (also a tesseract dependency), and automatic fixing of page
> rotation.
>
> Cool!


> --
> Sean Whitton
>