Re: Release schedule for 2.x and 3.x?
I'm going to rerun the eval on 2.9.3-rc1 after I cherry picked the csv fixes today. On Mon, Jan 27, 2025 at 1:25 PM Nicholas DiPiazza wrote: > > that's wonderful. thanks for that. > i'm concentrating on finishing up tika-pipes so i can get the removal PR > started. > getting very close - maybe set up a zoom sometime to chat > > On Mon, Jan 27, 2025 at 9:50 AM Tim Allison wrote: > > > I'm kicking off the regression tests for 3.x. > > > > Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope > > that's ok. > > > > On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr > > wrote: > > > > > Hi, > > > > > > No opinion re release schedule but a comment on the PDFBox update: > > > > > > tl;dr: ignore the PDF differences this time. > > > > > > The new version includes the /ActualText support: > > > https://issues.apache.org/jira/browse/PDFBOX-5868 > > > > > > It is always enabled. In most cases the extraction is better. But > > > sometimes content is lost because the feature is used for obfuscation > > > (see example in the issue above). > > > > > > Another major change is the detection of the space width: > > > https://issues.apache.org/jira/browse/PDFBOX-5920 > > > It has been improved, however this will result in many differences with > > > angled texts if angle detection isn't enabled. Some scientific texts > > > with superscript prefix will also look different, "1 Coupled" will > > > extract as "1Coupled". This is because these fonts don't have a space > > > and the fallback we are using sucks. > > > > > > Tilman > > > > > > On 16.01.2025 14:20, Tim Allison wrote: > > > > Sorry, on second thought, a small tweak: > > > > > > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further > > > > propose that we make a 2.9.3 release at some point after the 3.1.0 > > > release > > > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x > > EOL > > > > release in April, 2025. > > > > > > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison > > wrote: > > > > > > > >> All, > > > >>It has been a while since we last released 2.x (April 2024) and 3.x > > > >> (October 2024). We've had a number of dependency updates. PDFBox is on > > > the > > > >> cusp of a new 3.x release. > > > >>I propose that we release 3.1.0 after PDFBox 3.x is released and > > > that we > > > >> make a 2.9.3 release the following week. > > > >>WDYT? > > > >> > > > >> Best, > > > >> > > > >> Tim > > > >> > > > > > > > >
Re: Release schedule for 2.x and 3.x?
that's wonderful. thanks for that. i'm concentrating on finishing up tika-pipes so i can get the removal PR started. getting very close - maybe set up a zoom sometime to chat On Mon, Jan 27, 2025 at 9:50 AM Tim Allison wrote: > I'm kicking off the regression tests for 3.x. > > Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope > that's ok. > > On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr > wrote: > > > Hi, > > > > No opinion re release schedule but a comment on the PDFBox update: > > > > tl;dr: ignore the PDF differences this time. > > > > The new version includes the /ActualText support: > > https://issues.apache.org/jira/browse/PDFBOX-5868 > > > > It is always enabled. In most cases the extraction is better. But > > sometimes content is lost because the feature is used for obfuscation > > (see example in the issue above). > > > > Another major change is the detection of the space width: > > https://issues.apache.org/jira/browse/PDFBOX-5920 > > It has been improved, however this will result in many differences with > > angled texts if angle detection isn't enabled. Some scientific texts > > with superscript prefix will also look different, "1 Coupled" will > > extract as "1Coupled". This is because these fonts don't have a space > > and the fallback we are using sucks. > > > > Tilman > > > > On 16.01.2025 14:20, Tim Allison wrote: > > > Sorry, on second thought, a small tweak: > > > > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further > > > propose that we make a 2.9.3 release at some point after the 3.1.0 > > release > > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x > EOL > > > release in April, 2025. > > > > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison > wrote: > > > > > >> All, > > >>It has been a while since we last released 2.x (April 2024) and 3.x > > >> (October 2024). We've had a number of dependency updates. PDFBox is on > > the > > >> cusp of a new 3.x release. > > >>I propose that we release 3.1.0 after PDFBox 3.x is released and > > that we > > >> make a 2.9.3 release the following week. > > >>WDYT? > > >> > > >> Best, > > >> > > >> Tim > > >> > > > > >
Re: Release schedule for 2.x and 3.x?
I'm kicking off the regression tests for 3.x. Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope that's ok. On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr wrote: > Hi, > > No opinion re release schedule but a comment on the PDFBox update: > > tl;dr: ignore the PDF differences this time. > > The new version includes the /ActualText support: > https://issues.apache.org/jira/browse/PDFBOX-5868 > > It is always enabled. In most cases the extraction is better. But > sometimes content is lost because the feature is used for obfuscation > (see example in the issue above). > > Another major change is the detection of the space width: > https://issues.apache.org/jira/browse/PDFBOX-5920 > It has been improved, however this will result in many differences with > angled texts if angle detection isn't enabled. Some scientific texts > with superscript prefix will also look different, "1 Coupled" will > extract as "1Coupled". This is because these fonts don't have a space > and the fallback we are using sucks. > > Tilman > > On 16.01.2025 14:20, Tim Allison wrote: > > Sorry, on second thought, a small tweak: > > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further > > propose that we make a 2.9.3 release at some point after the 3.1.0 > release > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL > > release in April, 2025. > > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison wrote: > > > >> All, > >>It has been a while since we last released 2.x (April 2024) and 3.x > >> (October 2024). We've had a number of dependency updates. PDFBox is on > the > >> cusp of a new 3.x release. > >>I propose that we release 3.1.0 after PDFBox 3.x is released and > that we > >> make a 2.9.3 release the following week. > >>WDYT? > >> > >> Best, > >> > >> Tim > >> > >
Re: Release schedule for 2.x and 3.x?
This is very helpful. Thank you, Tilman! On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr wrote: > Hi, > > No opinion re release schedule but a comment on the PDFBox update: > > tl;dr: ignore the PDF differences this time. > > The new version includes the /ActualText support: > https://issues.apache.org/jira/browse/PDFBOX-5868 > > It is always enabled. In most cases the extraction is better. But > sometimes content is lost because the feature is used for obfuscation > (see example in the issue above). > > Another major change is the detection of the space width: > https://issues.apache.org/jira/browse/PDFBOX-5920 > It has been improved, however this will result in many differences with > angled texts if angle detection isn't enabled. Some scientific texts > with superscript prefix will also look different, "1 Coupled" will > extract as "1Coupled". This is because these fonts don't have a space > and the fallback we are using sucks. > > Tilman > > On 16.01.2025 14:20, Tim Allison wrote: > > Sorry, on second thought, a small tweak: > > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further > > propose that we make a 2.9.3 release at some point after the 3.1.0 > release > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL > > release in April, 2025. > > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison wrote: > > > >> All, > >>It has been a while since we last released 2.x (April 2024) and 3.x > >> (October 2024). We've had a number of dependency updates. PDFBox is on > the > >> cusp of a new 3.x release. > >>I propose that we release 3.1.0 after PDFBox 3.x is released and > that we > >> make a 2.9.3 release the following week. > >>WDYT? > >> > >> Best, > >> > >> Tim > >> > >
Re: Release schedule for 2.x and 3.x?
Hi, No opinion re release schedule but a comment on the PDFBox update: tl;dr: ignore the PDF differences this time. The new version includes the /ActualText support: https://issues.apache.org/jira/browse/PDFBOX-5868 It is always enabled. In most cases the extraction is better. But sometimes content is lost because the feature is used for obfuscation (see example in the issue above). Another major change is the detection of the space width: https://issues.apache.org/jira/browse/PDFBOX-5920 It has been improved, however this will result in many differences with angled texts if angle detection isn't enabled. Some scientific texts with superscript prefix will also look different, "1 Coupled" will extract as "1Coupled". This is because these fonts don't have a space and the fallback we are using sucks. Tilman On 16.01.2025 14:20, Tim Allison wrote: Sorry, on second thought, a small tweak: I propose that we release 3.1.0 after PDFBox 3.x is released. I further propose that we make a 2.9.3 release at some point after the 3.1.0 release IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL release in April, 2025. On Thu, Jan 16, 2025 at 8:15 AM Tim Allison wrote: All, It has been a while since we last released 2.x (April 2024) and 3.x (October 2024). We've had a number of dependency updates. PDFBox is on the cusp of a new 3.x release. I propose that we release 3.1.0 after PDFBox 3.x is released and that we make a 2.9.3 release the following week. WDYT? Best, Tim
Re: Release schedule for 2.x and 3.x?
Sorry, on second thought, a small tweak: I propose that we release 3.1.0 after PDFBox 3.x is released. I further propose that we make a 2.9.3 release at some point after the 3.1.0 release IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL release in April, 2025. On Thu, Jan 16, 2025 at 8:15 AM Tim Allison wrote: > All, > It has been a while since we last released 2.x (April 2024) and 3.x > (October 2024). We've had a number of dependency updates. PDFBox is on the > cusp of a new 3.x release. > I propose that we release 3.1.0 after PDFBox 3.x is released and that we > make a 2.9.3 release the following week. > WDYT? > > Best, > > Tim >