Re: PDFBox 2.0.33 release

2025-01-12 Thread Andreas Lehmkühler
Just a friendly reminder, I'm planning to cut the release today in about 10 hours from now. Andreas Am 30.12.24 um 10:43 schrieb Andreas Lehmkühler: Hi, IMHO it is time to cut another 2.0.x release. I'm planing to do so in two weeks from now. Any objections? Andreas P.S.: I'd like to cut

Re: PDFBox 2.0.33 release

2025-01-12 Thread Andreas Lehmkühler
Am 12.01.25 um 13:58 schrieb sahy...@fileaffairs.de: Am Sonntag, dem 12.01.2025 um 13:24 +0100 schrieb Andreas Lehmkühler: Am 08.01.25 um 04:56 schrieb Tilman Hausherr: On 07.01.2025 15:00, Tilman Hausherr wrote: - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 |

Re: PDFBox 2.0.33 release

2025-01-12 Thread sahy...@fileaffairs.de
Am Sonntag, dem 12.01.2025 um 13:24 +0100 schrieb Andreas Lehmkühler: > > > Am 08.01.25 um 04:56 schrieb Tilman Hausherr: > > On 07.01.2025 15:00, Tilman Hausherr wrote: > > > - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | > > > creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message:

Re: PDFBox 2.0.33 release

2025-01-12 Thread Andreas Lehmkühler
Am 08.01.25 um 04:56 schrieb Tilman Hausherr: On 07.01.2025 15:00, Tilman Hausherr wrote: - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text e

Re: PDFBox 2.0.33 release

2025-01-07 Thread Tilman Hausherr
On 07.01.2025 15:00, Tilman Hausherr wrote: - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text extractions, not even with Tika from the command li

Re: PDFBox 2.0.33 release

2025-01-07 Thread Tilman Hausherr
On 07.01.2025 14:10, Tilman Hausherr wrote: latest: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-6.tar.xz So this is pretty good now. Here's what I found: - superscript degradation ("1 coupled" becomes "1coupled"): annoying, but should be solved separately some day with

Re: PDFBox 2.0.33 release

2025-01-07 Thread Tilman Hausherr
latest: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-6.tar.xz - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: PDFBox 2.0.33 release

2025-01-06 Thread Tilman Hausherr
On 06.01.2025 10:19, Tilman Hausherr wrote: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-4.tar.xz This is starting to look much better. I'm doing another "B" run that includes the last change from PDFBOX-5920. It does not yet include the revert from PDFBOX-5384. ht

Re: PDFBox 2.0.33 release

2025-01-06 Thread Tilman Hausherr
On 05.01.2025 18:17, Tilman Hausherr wrote: The last commit also fixes the changes in 2 italian files, e.g. P5ZY75DGZMAP3VYLGTWRWTIOZN7IPXJO. However it turns out that their text extraction was terrible before. The "La storia" part below the blue bar in fine print is also difficult to read f

Re: PDFBox 2.0.33 release

2025-01-05 Thread Tilman Hausherr
On 05.01.2025 18:17, Tilman Hausherr wrote: The last commit also fixes the changes in 2 italian files, e.g. P5ZY75DGZMAP3VYLGTWRWTIOZN7IPXJO. However it turns out that their text extraction was terrible before. The "La storia" part below the blue bar in fine print is also difficult to read for

Re: PDFBox 2.0.33 release

2025-01-05 Thread Tilman Hausherr
The last commit also fixes the changes in 2 italian files, e.g. P5ZY75DGZMAP3VYLGTWRWTIOZN7IPXJO. However it turns out that their text extraction was terrible before. The "La storia" part below the blue bar in fine print is also difficult to read for a human. I'm gonna start another "B" run so

Re: PDFBox 2.0.33 release

2025-01-05 Thread Andreas Lehmkühler
Hi, thanks for running the tests. I'm going to have a look as well. Andreas Am 05.01.25 um 13:47 schrieb Tilman Hausherr: On 04.01.2025 21:26, Tilman Hausherr wrote: After that, I'll do another "B" run but with ActualText disabled, because this is responsible for most of the differences. S

Re: PDFBox 2.0.33 release

2025-01-05 Thread Tilman Hausherr
On 04.01.2025 21:26, Tilman Hausherr wrote: After that, I'll do another "B" run but with ActualText disabled, because this is responsible for most of the differences. Some are improvements, some are not. https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-3.tar.xz Currently

Re: PDFBox 2.0.33 release

2025-01-04 Thread Tilman Hausherr
After that, I'll do another "B" run but with ActualText disabled, because this is responsible for most of the differences. Some are improvements, some are not. https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-3.tar.xz

Re: PDFBox 2.0.33 release

2025-01-04 Thread Tilman Hausherr
On 04.01.2025 14:27, Tilman Hausherr wrote: On 04.01.2025 10:46, Tilman Hausherr wrote: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33.tar.xz definitively some work to do; one exception in CMap and some content differences / losses. I'm currently doing another "B" run to

Re: PDFBox 2.0.33 release

2025-01-04 Thread Tilman Hausherr
On 04.01.2025 10:46, Tilman Hausherr wrote: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33.tar.xz definitively some work to do; one exception in CMap and some content differences / losses. I'm currently doing another "B" run to be sure that there are no further exceptions

Re: PDFBox 2.0.33 release

2025-01-04 Thread Tilman Hausherr
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33.tar.xz definitively some work to do; one exception in CMap and some content differences / losses. The 2.0.33 snapshot version is from 31.12, I forgot to build 2.0. However there have been no changes related parsing or text extra

Re: PDFBox 2.0.33 release

2024-12-30 Thread Tilman Hausherr
+1 I'm setting myself a notice to start regression tests for 2.0 on 4.1 Tilman On 30.12.2024 10:43, Andreas Lehmkühler wrote: Hi, IMHO it is time to cut another 2.0.x release. I'm planing to do so in two weeks from now. Any objections? Andreas P.S.: I'd like to cut the next 3.0.x release