Re: text extraction regression tests for 3.x?

2022-06-17 Thread Tim Allison
I wouldn't. :D On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr wrote: > Am 15.06.2022 um 12:19 schrieb Tim Allison: > > Reports are here: > > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz > > govdocs1/372/372582.pdf > commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5 >

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Tilman Hausherr
Am 15.06.2022 um 12:19 schrieb Tim Allison: Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz govdocs1/372/372582.pdf commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5 commoncrawl3/VN/VNCWMY6Y4C3XYWA65CQPPSNZSY6OQEEA have lost text. But the first one is a

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Andreas Lehmkuehler
Am 15.06.22 um 13:07 schrieb Tim Allison: In "parse_time_millis_details.xlsx", there are some that took much longer in 3.x during the multithreaded run but do not show much of a difference singlethreaded...likely accidents of resources available at parse time. Overall, the sum of processing

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Andreas Lehmkuehler
Am 15.06.22 um 12:19 schrieb Tim Allison: Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz @Tim thanks again Looks like there aren't any new exceptions in 3.0.0 at all, ergo we are good to target a new release :-) Andreas On Mon, Jun 13, 2022 at 4:54

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
In "parse_time_millis_details.xlsx", there are some that took much longer in 3.x during the multithreaded run but do not show much of a difference singlethreaded...likely accidents of resources available at parse time. Overall, the sum of processing times across all files is very similar.

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
I had a chance to look at new_catastrophic_exceptions_in_b, and the three files in there take roughly the same amount of time and resources. I think they failed on trunk only because of the whims of multithreading and available resources at the time. This file is admittedly quite large, but it

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz On Mon, Jun 13, 2022 at 4:54 PM Tim Allison wrote: > Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET). > > On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler > wrote: > >> I've fixed

Re: text extraction regression tests for 3.x?

2022-06-13 Thread Tim Allison
Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET). On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler wrote: > I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456 > > @Tim is there any chance to rerun the regression tests? > > Thanks in advance > Andreas

Re: text extraction regression tests for 3.x?

2022-06-11 Thread Andreas Lehmkuehler
I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456 @Tim is there any chance to rerun the regression tests? Thanks in advance Andreas Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler: I've found another regression, see PDFBOX-5452 Andreas Am 29.05.22 um 18:37 schrieb Andreas

Re: text extraction regression tests for 3.x?

2022-06-07 Thread Andreas Lehmkuehler
I've found another regression, see PDFBOX-5452 Andreas Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler: Thanks Tim, looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447. Maybe there are more to come Andreas Am 26.05.22 um 15:04 schrieb Tim Allison: Apologies for

Re: text extraction regression tests for 3.x?

2022-05-31 Thread Tim Allison
Good to find them now! Let me know when I should rerun and thank you! Best, Tim On Sun, May 29, 2022 at 12:37 PM Andreas Lehmkuehler wrote: > Thanks Tim, > > looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447. > > Maybe there are more to come > > Andreas > > >

Re: text extraction regression tests for 3.x?

2022-05-29 Thread Andreas Lehmkuehler
Thanks Tim, looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447. Maybe there are more to come Andreas Am 26.05.22 um 15:04 schrieb Tim Allison: Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. The reports are here:

Re: text extraction regression tests for 3.x?

2022-05-26 Thread Tim Allison
Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. The reports are here: https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz Happy to rerun with a more recent version of trunk. Cheers, Tim On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote:

Re: text extraction regression tests for 3.x?

2022-05-08 Thread Andreas Lehmkuehler
Am 06.05.22 um 14:30 schrieb Tim Allison: All, Let me know when makes sense to run the text extraction regression Yes, it'd be useful to have some update results. How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. 3.0.0-alpha3? tests for 3.x. I regret I haven't been

text extraction regression tests for 3.x?

2022-05-06 Thread Tim Allison
All, Let me know when makes sense to run the text extraction regression tests for 3.x. I regret I haven't been following our mailing list as closely as I should be. Cheers, Tim