Re: 2.0.30 vs 2.0.31 reloaded

2024-03-28 Thread Tim Allison
W00t! Thank you, Tilman! And thank you @Maruan Sahyoun for hosting the regression server!!! On Wed, Mar 27, 2024 at 3:48 PM Tilman Hausherr wrote: > During the tika regression tests it turned out that there is a longer > PDF list, so I ran the tests again with that longer list. The good news

Re: [VOTE] Release Apache PDFBox 2.0.31

2024-03-22 Thread Tim Allison
Not a showstopper for me. +1 Thank you! On Fri, Mar 22, 2024 at 1:59 AM Maruan Sahyoun wrote: > IMHO this is not a show stopper > > > > Am 22.03.2024 um 06:54 schrieb Andreas Lehmkühler > : > > > >  > > > >> Am 21.03.24 um 20:07 schrieb Tim Alliso

Re: [VOTE] Release Apache PDFBox 2.0.31

2024-03-21 Thread Tim Allison
In the parent pom.xml in the zip file, there's a "release" submodule specified. However, there's no release directory in the src zip that would match: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.31/release/ Is that expected? On Thu, Mar 21, 2024 at 1:53 PM Andreas Lehmkühler wrote: > Hi,

Re: [VOTE] Release Apache PDFBox 3.0.1

2023-11-28 Thread Tim Allison
+1 Thank you! On Tue, Nov 28, 2023 at 5:49 AM Timo Boehme wrote: > +1, > > Thanks, > Timo > > > Am 27.11.23 um 17:46 schrieb Andreas Lehmkühler: > > Hi, > > > > a candidate for the PDFBox 3.0.1 release is available at: > > > > https://dist.apache.org/repos/dist/dev/pdfbox/3.0.1/ > > > >

Re: PDFBox 3.0.1 release?

2023-11-27 Thread Tim Allison
Doh. I didn't. Thank you, @Tilman! On Wed, Nov 22, 2023 at 2:08 AM Andreas Lehmkühler wrote: > Hi, > > after fixing the latest regressions I'd like to cut the 3.0.1 release > next Monday/Tuesday. > > WDYT? > > @Tim, @Tilman do you have the time to run the extraction tests 3.0.0 vs > 3.0.1 ? > >

[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-18 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766567#comment-17766567 ] Tim Allison commented on PDFBOX-5682: - Wow. Thank you! > Long/permanent hang in PDFBox

[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228 ] Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM

[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225 ] Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM: -- Thank you

[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228 ] Tim Allison commented on PDFBOX-5682: - This is the part from that document that is, erm, eye-opening

[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225 ] Tim Allison commented on PDFBOX-5682: - Thank you, [~lehmi]. In Tika, we initially copied PDFBox's

[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763903#comment-17763903 ] Tim Allison commented on PDFBOX-5682: - Both files spend quite a bit of time

[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763904#comment-17763904 ] Tim Allison commented on PDFBOX-5682: - It looks like that causes a full parse of the file? > L

[jira] [Updated] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5682: Summary: Long/permanent hang in PDFBox 3.x (was: Long/permanent hang i n PDFBox 3.x) > L

[jira] [Created] (PDFBOX-5682) Long/permanent hang i n PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5682: --- Summary: Long/permanent hang i n PDFBox 3.x Key: PDFBOX-5682 URL: https://issues.apache.org/jira/browse/PDFBOX-5682 Project: PDFBox Issue Type: Bug

[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763759#comment-17763759 ] Tim Allison commented on PDFBOX-5681: - When I run the demo code in PDFBox trunk with logging on, I

[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763754#comment-17763754 ] Tim Allison commented on PDFBOX-5681: - I initially thought this was a threading issue, but it isn't

[jira] [Created] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5681: --- Summary: ConcurrentModificationException in getObjectsByType() in 3.x Key: PDFBOX-5681 URL: https://issues.apache.org/jira/browse/PDFBOX-5681 Project: PDFBox

[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Affects Version/s: 3.0.0 PDFBox > ConcurrentModificationException in getObjectsByType() in

[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Description: [~tilman]'s regression testing turned up this exception when we integrate PDFBox

[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5681: Issue Type: Bug (was: Task) > ConcurrentModificationException in getObjectsByType() in

PDFBox in AWS blog

2023-07-13 Thread Tim Allison
https://aws.amazon.com/blogs/machine-learning/retain-original-pdf-formatting-to-view-translated-documents-with-amazon-textract-amazon-translate-and-pdfbox/

Re: PDFBox 2.0.29 release?

2023-06-02 Thread Tim Allison
.writeText(PDFTextStripper.java:238) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr wrote: > Yes please > > Thanks > > Tilman > > On 31.05.2023 17:15, Tim Allison wrote: > > +1 > > > > Let me know when/if

Re: PDFBox 2.0.29 release?

2023-05-31 Thread Tim Allison
+1 Let me know when/if I should run the text extraction regression tests. On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de < sahy...@fileaffairs.de> wrote: > +1 > > Maruan > > Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas Lehmkuehler: > > Hi, > > > > I tend to release 2.0.29

New CommonCrawl-based PDF-focused corpus

2023-05-16 Thread Tim Allison
8 million PDFs/8TB from a month of Common Crawl. We refetched ~2 million truncated files. Zips of PDFs are available here: https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ Peter Wyatt (PDF Association)'s writeup is here:

[jira] [Updated] (PDFBOX-5595) Slight regression on corrupt bug tracker file

2023-05-05 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5595: Description: I'm not sure this is a regression, and apologies if you already dealt

[jira] [Created] (PDFBOX-5595) Slight regression on corrupt bug tracker file

2023-05-05 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5595: --- Summary: Slight regression on corrupt bug tracker file Key: PDFBOX-5595 URL: https://issues.apache.org/jira/browse/PDFBOX-5595 Project: PDFBox Issue Type

Re: [VOTE] Release Apache PDFBox 2.0.28

2023-04-11 Thread Tim Allison
+1 Thank you! On Mon, Apr 10, 2023 at 12:08 PM sahy...@fileaffairs.de wrote: > > +1 > > Maruan > > Am Montag, dem 10.04.2023 um 12:15 +0200 schrieb Andreas Lehmkuehler: > > Hi, > > > > a candidate for the PDFBox 2.0.28 release is available at: > > > >

Re: Fwd: 2.0.28 release?

2023-04-10 Thread Tim Allison
on. At least the exception > is > gone, maybe there is some more content or just an empty page. > > However, IMHO that isn't a regression, but an (small) improvement. > > @Tim Thanks for running the tests > > Andreas > > > Am 10.04.23 um 12:54 schrieb Tim Allison: >

Re: Fwd: 2.0.28 release?

2023-04-10 Thread Tim Allison
044) On Mon, Apr 10, 2023 at 6:41 AM Tim Allison wrote: > > Y. Will start process now. Thank you! > > On Mon, Apr 10, 2023 at 6:20 AM Andreas Lehmkuehler wrote: > > > > Hi, > > > > I've finished the release process and provided a releases candidate for > >

Re: Fwd: 2.0.28 release?

2023-04-10 Thread Tim Allison
gt; >>>>> pdf well. > >>>>> > >>>>> IMHO we should leave it alone, as it is malformed anmd doesn't contain > >>>>> any > >>>>> useful content. More important, it is one pdf out of hundreds of > &

Re: Re: Fwd: 2.0.28 release?

2023-04-03 Thread Tim Allison
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-20230403-reports.tgz Haven't had a chance to take a look yet. :( On Mon, Apr 3, 2023 at 6:53 AM Tilman Hausherr wrote: > > Don't wait please > Thanks > Tilman > > > > --- Original-Nachricht --- > Von

Re: Fwd: 2.0.28 release?

2023-04-03 Thread Tim Allison
Lehmkuehler: > >> > >> I've accidentally send this to Tim only :-| > >> > >> Weitergeleitete Nachricht > >> Betreff: Re: 2.0.28 release? > >> Datum: Fri, 31 Mar 2023 07:50:10 +0200 > >> Von: Andreas Lehmkuehler > &

Re: 2.0.28 release?

2023-03-30 Thread Tim Allison
Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-SNAPSHOT.tgz On Tue, Mar 28, 2023 at 10:42 PM Tilman Hausherr wrote: > > Yes please! > > Thanks > > Tilman > > On 28.03.2023 19:22, Tim Allison wrote: > > +1 > > >

Re: 2.0.28 release?

2023-03-28 Thread Tim Allison
+1 Should I run the regression tests now or is there anything else text related that is still being worked on? On Tue, Mar 28, 2023 at 1:05 PM Tilman Hausherr wrote: > > +1 > > Tilman > > On 28.03.2023 08:46, Andreas Lehmkuehler wrote: > > Hi, > > > > how about cutting a 2.0.28 release next

[jira] [Updated] (PDFBOX-5550) reduce number of open files

2022-12-05 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5550: Summary: reduce number of open files (was: redcuce number of open files) > reduce number of o

[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output

2022-11-17 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635337#comment-17635337 ] Tim Allison commented on PDFBOX-5540: - Should I kick that off now? > export:text creates jibber

Re: [VOTE] Release Apache PDFBox 2.0.27

2022-09-26 Thread Tim Allison
+1 clean build with jdk 8 on macOS Monterey. shasum checks out, and this release candidate works with Tika's unit tests. Thank you! Cheers, Tim P.S. Should I open an issue to remove the printlns in AbstractSchemaTester? On Mon, Sep 26, 2022 at 11:29 AM Andreas Lehmkuehler wrote: > > a

Re: Release 2.0.27

2022-09-20 Thread Tim Allison
>PS thanks for doing the test! Thank _you_ for the confirmation! Onwards! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Release 2.0.27

2022-09-20 Thread Tim Allison
As confirmation, did we expect more diffs/ did I botch the runs and compare the same version? There are a few _diffs_ at least. On Tue, Sep 20, 2022 at 8:41 AM Tim Allison wrote: > > Reports are here: > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-pre-rc.tgz > > L

Re: Release 2.0.27

2022-09-20 Thread Tim Allison
gt; Andreas > > Am 19.09.22 um 21:09 schrieb Tim Allison: > > I should have time tomorrow/Wednesday. Thank you! > > > > On Mon, Sep 19, 2022 at 2:30 PM Tilman Hausherr > > wrote: > >> > >> On 19.09.2022 08:22, Andreas Lehmkuehler wrote: > &g

Re: Release 2.0.27

2022-09-19 Thread Tim Allison
I should have time tomorrow/Wednesday. Thank you! On Mon, Sep 19, 2022 at 2:30 PM Tilman Hausherr wrote: > > On 19.09.2022 08:22, Andreas Lehmkuehler wrote: > > 1.8.17 is out of the door and I guess it is time for 2.0.27 release. > > > > @Tim or @Tilman > > Is there any chance to run the

[jira] [Commented] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602789#comment-17602789 ] Tim Allison commented on PDFBOX-5501: - Thank you! > Jempbox is slow on xmp with large ev

[jira] [Resolved] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-08 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-5501. - Resolution: Not A Problem Y. I just also confirmed that this is fixed in 1.8.17-SNAPSHOT

[jira] [Created] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-08 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5501: --- Summary: Jempbox is slow on xmp with large event histories Key: PDFBOX-5501 URL: https://issues.apache.org/jira/browse/PDFBOX-5501 Project: PDFBox Issue Type

timeout on Jempbox xmp media management schema's getHistory

2022-09-07 Thread Tim Allison
All, This issue is ringing a bell. I'm sorry if there's an open issue or you/we've decided long ago that this is not an issue. One of the timeouts in the most recent run was caused by Jempbox's handling of the history in the media management schema. There are 32000 elements in the history.

[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578904#comment-17578904 ] Tim Allison commented on PDFBOX-5490: - Y. Completely understand. I don't want to impede 3.0.0

[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578510#comment-17578510 ] Tim Allison commented on PDFBOX-5490: - My initial request would be for whether or not the xref table

[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578129#comment-17578129 ] Tim Allison commented on PDFBOX-5490: - Oh, that looks great. > Add reconstruction informat

[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578055#comment-17578055 ] Tim Allison commented on PDFBOX-5490: - A Listener would be great. Any mechanism that would allow

[jira] [Updated] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5490: Component/s: Parsing > Add reconstruction information to the PDDocum

[jira] [Created] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5490: --- Summary: Add reconstruction information to the PDDocument Key: PDFBOX-5490 URL: https://issues.apache.org/jira/browse/PDFBOX-5490 Project: PDFBox Issue Type

Re: text extraction regression tests for 3.x?

2022-06-17 Thread Tim Allison
I wouldn't. :D On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr wrote: > Am 15.06.2022 um 12:19 schrieb Tim Allison: > > Reports are here: > > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz > > govdocs1/372/372582.pdf > commoncrawl3/KH/KHDACXIP

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
fetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN On Wed, Jun 15, 2022 at 6:49 AM Tim Allison wrote: > I had a chance to look at new_catastrophic_exceptions_in_b, and the three > files in there take roughly the same amount of time and resources. I think > they failed on trunk only b

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
AM Tim Allison wrote: > Reports are here: > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz > > On Mon, Jun 13, 2022 at 4:54 PM Tim Allison wrote: > >> Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET). >> >> On Sat

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz On Mon, Jun 13, 2022 at 4:54 PM Tim Allison wrote: > Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET). > > On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler > wrote: >

Re: text extraction regression tests for 3.x?

2022-06-13 Thread Tim Allison
like there are some regressions, see PDFBOX-5444 and PDFBOX-5447. > >> > >> Maybe there are more to come > >> > >> Andreas > >> > >> > >> Am 26.05.22 um 15:04 schrieb Tim Allison: > >>> Apologies for my delay. I ran trunk/3.x

Re: text extraction regression tests for 3.x?

2022-05-31 Thread Tim Allison
... > > Andreas > > > Am 26.05.22 um 15:04 schrieb Tim Allison: > > Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. The > > reports are here: > > > https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz > > > > Hap

Re: text extraction regression tests for 3.x?

2022-05-26 Thread Tim Allison
: > Am 06.05.22 um 14:30 schrieb Tim Allison: > > All, > >Let me know when makes sense to run the text extraction regression > Yes, it'd be useful to have some update results. > > How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. > 3.0.0-alpha3

[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Description: I noticed a new NPE in one of our test files on Tika when I recently built PDFBox's

[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Component/s: XmpBox > New NPE in xmpbox parser in tr

[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5431: Affects Version/s: 3.0.0 PDFBox > New NPE in xmpbox parser in tr

[jira] [Created] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5431: --- Summary: New NPE in xmpbox parser in trunk Key: PDFBOX-5431 URL: https://issues.apache.org/jira/browse/PDFBOX-5431 Project: PDFBox Issue Type: Task

text extraction regression tests for 3.x?

2022-05-06 Thread Tim Allison
All, Let me know when makes sense to run the text extraction regression tests for 3.x. I regret I haven't been following our mailing list as closely as I should be. Cheers, Tim - To

Re: [VOTE] Release Apache PDFBox 2.0.26

2022-04-18 Thread Tim Allison
+1 Thank you! On Mon, Apr 18, 2022 at 10:50 AM sahy...@fileaffairs.de wrote: > > +1 > Maruan > > Am Montag, dem 18.04.2022 um 13:14 +0200 schrieb Andreas Lehmkuehler: > > Hi, > > > > a candidate for the PDFBox 2.0.26 release is available at: > > > >

[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-14 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522531#comment-17522531 ] Tim Allison commented on PDFBOX-5415: - An answer on the Tika side. Yes, parsing is dangerous

[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521382#comment-17521382 ] Tim Allison commented on PDFBOX-5415: - Michael Demey's diagnosis: https://twitter.com/MyMilkedEek

[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5415: Affects Version/s: 2.0.26 > Infinite loop in ExtractText in 2.x branch on a specific

[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5415: Component/s: Parsing > Infinite loop in ExtractText in 2.x branch on a specific

[jira] [Created] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5415: --- Summary: Infinite loop in ExtractText in 2.x branch on a specific pdf Key: PDFBOX-5415 URL: https://issues.apache.org/jira/browse/PDFBOX-5415 Project: PDFBox

Re: 2.0.26 release

2022-04-11 Thread Tim Allison
Yes. Sorry. That's my fault. I did something stupid in 2.3.0 and then fixed it in 2.4.0-SNAPSHOT (TIKA-3711). On Mon, Apr 11, 2022 at 1:35 PM Tilman Hausherr wrote: > Am 11.04.2022 um 12:40 schrieb Tim Allison: > > https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz &g

Re: 2.0.26 release

2022-04-11 Thread Tim Allison
https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz Haven't had a chance to review. Hot off the vm. On Sun, Apr 10, 2022 at 9:58 AM Tim Allison wrote: > > Will try to kick off today…first thing Monday morning (EDT) at the latest. > > On Sun, Apr 10, 2022 at 9:0

Re: 2.0.26 release

2022-04-10 Thread Tim Allison
Will try to kick off today…first thing Monday morning (EDT) at the latest. On Sun, Apr 10, 2022 at 9:05 AM Andreas Lehmkuehler wrote: > Am 09.04.22 um 19:00 schrieb Tilman Hausherr: > > testFlattenPDFBOX2469Filled also fails in 2.0 (it is disabled by > default). > I've fixed all new tickets.

Re: 2.0.26 release

2022-04-07 Thread Tim Allison
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.26-snapshot-reports.tgz I haven't had a chance to look at them yet. On Thu, Apr 7, 2022 at 9:07 AM Andreas Lehmkühler wrote: > > Yes, please > > Thanks in advance > Andreas > > 07.04.2022 11:44:38 Tim Allison : > &g

[jira] [Resolved] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-04-07 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved PDFBOX-5396. - Fix Version/s: 2.0.26 Resolution: Fixed > Add maven enforcer rule to ens

Re: 2.0.26 release

2022-04-07 Thread Tim Allison
Sounds great! Should I rerun the regression tests today? On Thu, Apr 7, 2022 at 1:41 AM Andreas Lehmkuehler wrote: > Hi, > > sorry for the delay. I'm planning to cut the 2.0.26 release next > Saturday, the > day after tomorrow, if nobody objects. > > Andreas > > P.S.: I'm targeting a new 3.0.0

[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512474#comment-17512474 ] Tim Allison commented on PDFBOX-5401: - bq. Hi, I didn't test these samples on PDFBOX 2.0 Sorry, my

[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397 ] Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 4:38 PM: --- I

[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397 ] Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 2:07 PM: --- Can

[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397 ] Tim Allison commented on PDFBOX-5401: - Can confirm behavior with the last 2.0.26-SNAPSHOT I used

Re: 2.0.26 release? WAS: JBIG2 3.0.4 release?

2022-03-22 Thread Tim Allison
that on the Tika side. On Mon, Mar 21, 2022 at 12:53 PM Andreas Lehmkuehler wrote: > > > Am 21.03.22 um 12:21 schrieb Tim Allison: > > I'm happy to run the tests today if that would be of any interest. > Yes, please. > > TIA > Andreas > > > > > > On Sun

[jira] [Commented] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509892#comment-17509892 ] Tim Allison commented on PDFBOX-5396: - This is not a problem in trunk. > Add maven enforcer r

[jira] [Updated] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5396: Description: I recently stubbed my toe on this one again. At least in the 2.x branch

[jira] [Created] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5396: --- Summary: Add maven enforcer rule to ensure that JAVA_HOME is set Key: PDFBOX-5396 URL: https://issues.apache.org/jira/browse/PDFBOX-5396 Project: PDFBox Issue

Re: 2.0.26 release? WAS: JBIG2 3.0.4 release?

2022-03-21 Thread Tim Allison
I'm happy to run the tests today if that would be of any interest. On Sun, Mar 20, 2022 at 5:01 PM Andreas Lehmkuehler wrote: > > Am 13.03.22 um 14:20 schrieb Tim Allison: > > From Tika's perspective, there's no rush. We're waiting for a bug fix > > in POI (TIKA-3699). >

Re: 2.0.26 release? WAS: JBIG2 3.0.4 release?

2022-03-13 Thread Tim Allison
troduced in 2018 with > 2.0.12. ;-) > > However, I'll have a look at the proposed solution. > > Andreas > > > > Tilman > > > > > >> > >> WDYT? > >> > >> Andreas > >> > >>> > >>> Tilman > &g

2.0.26 release? WAS: JBIG2 3.0.4 release?

2022-03-09 Thread Tim Allison
All, I've been out of the office for a bit and haven't caught up yet. Apologies if I've missed the discussion. Are there plans for a 2.0.26 release? We're probably a few weeks out from starting our next 1.x and 2.x releases on Tika, and it would be great to incorporate 2.0.26. No problem at

[jira] [Created] (PDFBOX-5358) Add support for UTF-8 in strings

2022-01-06 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5358: --- Summary: Add support for UTF-8 in strings Key: PDFBOX-5358 URL: https://issues.apache.org/jira/browse/PDFBOX-5358 Project: PDFBox Issue Type: Improvement

Re: Tika, POI and PDFBOX used in Pandora Papers

2021-10-12 Thread Tim Allison
Autocorrect!!! Tika On Tue, Oct 12, 2021 at 4:42 PM Tim Allison wrote: > > https://www.wired.co.uk/article/pandora-papers-leak > > Repo: > https://github.com/ICIJ/datashare/ - To unsubscribe, e-mai

Tim’s, POI and PDFBOX used in Pandora Papers

2021-10-12 Thread Tim Allison
https://www.wired.co.uk/article/pandora-papers-leak Repo: https://github.com/ICIJ/datashare/

Interesting PDF on stackoverflow

2021-07-21 Thread Tim Allison
https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly Not sure there's much we should do on the Tika side. How hard would it be to add an "extract only text that is on the page" feature? Best, Tim

Re: [VOTE] Release Apache PDFBox 2.0.24

2021-06-08 Thread Tim Allison
+1 Thank you! On Mon, Jun 7, 2021 at 12:52 PM Andreas Lehmkuehler wrote: > > Hi, > > a candidate for the PDFBox 2.0.24 release is available at: > > https://dist.apache.org/repos/dist/dev/pdfbox/2.0.24/ > > The release candidate is a zip archive of the sources in: > >

Re: 2.0.24 Release?

2021-06-03 Thread Tim Allison
, Tim On Mon, May 31, 2021 at 2:20 AM Andreas Lehmkuehler wrote: > > Am 30.05.21 um 20:13 schrieb Tim Allison: > > Will kick off tests on Tuesday, June 1 unless there are other text > > extraction changes planned. > Cool, I'm currently working on some 3.0 tickets so no

Re: 2.0.24 Release?

2021-05-30 Thread Tim Allison
Will kick off tests on Tuesday, June 1 unless there are other text extraction changes planned. On Sun, May 30, 2021 at 12:07 PM Andreas Lehmkuehler wrote: > I'm targeting the 7th or 8th of May. > > @Tim, @Tilman, is there any chance to run a 2.0.23 vs. 2.0.24 comparison > first? > > Andreas > >

Re: 2.0.24 Release?

2021-05-25 Thread Tim Allison
+1 :) On Tue, May 25, 2021 at 2:20 AM Andreas Lehmkuehler wrote: > Hi, > > how about cutting a 2.0.24 release in about 2 weeks from now? > > There is already an amount of solved tickets and our friends from Tika are > interested in a new version as well to cut a new release too including our >

[jira] [Commented] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326042#comment-17326042 ] Tim Allison commented on PDFBOX-5164: - Thank you, [~tilman]! > Create portable collection

[jira] [Commented] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325972#comment-17325972 ] Tim Allison commented on PDFBOX-5164: - Sorry to hijack this, but I wanted to confirm with [~zxltmj

[jira] [Updated] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5164: Attachment: tika-output.json > Create portable collection

Re: PDFBox 3.0.0-SNAPSHOT reports

2021-04-17 Thread Tim Allison
Trust me... the doubt was on me, not you! :D On Sat, Apr 17, 2021 at 5:15 AM Andreas Lehmkuehler wrote: > Am 16.04.21 um 01:15 schrieb Tim Allison: > > Diffs look suspiciously small...I may have to rerun the analyses. > We simply did a good job! ;-) > > Andreas > > &g

[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324082#comment-17324082 ] Tim Allison commented on PDFBOX-5166: - Ha @bitsgalore has an example of subtype=Screen. Yay

Re: PDFBox 3.0.0-SNAPSHOT reports

2021-04-16 Thread Tim Allison
There are a handful of files that "lose" attachments going into 3.0.0-SNAPSHOT because I haven't added the richmedia handling in our 3.0.0 branch. Best, Tim On Thu, Apr 15, 2021 at 7:15 PM Tim Allison wrote: > > Diffs look suspiciously small...I may have to rerun the anal

  1   2   3   4   5   6   >