Re: Release schedule for 2.x and 3.x?

2025-02-03 Thread Tim Allison
I'm going to rerun the eval on 2.9.3-rc1 after I cherry picked the csv
fixes today.

On Mon, Jan 27, 2025 at 1:25 PM Nicholas DiPiazza
 wrote:
>
> that's wonderful. thanks for that.
> i'm concentrating on finishing up tika-pipes so i can get the removal PR
> started.
> getting very close - maybe set up a zoom sometime to chat
>
> On Mon, Jan 27, 2025 at 9:50 AM Tim Allison  wrote:
>
> > I'm kicking off the regression tests for 3.x.
> >
> > Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope
> > that's ok.
> >
> > On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr 
> > wrote:
> >
> > > Hi,
> > >
> > > No opinion re release schedule but a comment on the PDFBox update:
> > >
> > > tl;dr: ignore the PDF differences this time.
> > >
> > > The new version includes the /ActualText support:
> > > https://issues.apache.org/jira/browse/PDFBOX-5868
> > >
> > > It is always enabled. In most cases the extraction is better. But
> > > sometimes content is lost because the feature is used for obfuscation
> > > (see example in the issue above).
> > >
> > > Another major change is the detection of the space width:
> > > https://issues.apache.org/jira/browse/PDFBOX-5920
> > > It has been improved, however this will result in many differences with
> > > angled texts if angle detection isn't enabled. Some scientific texts
> > > with superscript prefix will also look different, "1 Coupled" will
> > > extract as "1Coupled". This is because these fonts don't have a space
> > > and the fallback we are using sucks.
> > >
> > > Tilman
> > >
> > > On 16.01.2025 14:20, Tim Allison wrote:
> > > > Sorry, on second thought, a small tweak:
> > > >
> > > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further
> > > > propose that we make a 2.9.3 release at some point after the 3.1.0
> > > release
> > > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x
> > EOL
> > > > release in April, 2025.
> > > >
> > > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison 
> > wrote:
> > > >
> > > >> All,
> > > >>It has been a while since we last released 2.x (April 2024) and 3.x
> > > >> (October 2024). We've had a number of dependency updates. PDFBox is on
> > > the
> > > >> cusp of a new 3.x release.
> > > >>I propose that we release 3.1.0 after PDFBox 3.x is released and
> > > that we
> > > >> make a 2.9.3 release the following week.
> > > >>WDYT?
> > > >>
> > > >>  Best,
> > > >>
> > > >>   Tim
> > > >>
> > >
> > >
> >


Re: Release schedule for 2.x and 3.x?

2025-01-27 Thread Nicholas DiPiazza
that's wonderful. thanks for that.
i'm concentrating on finishing up tika-pipes so i can get the removal PR
started.
getting very close - maybe set up a zoom sometime to chat

On Mon, Jan 27, 2025 at 9:50 AM Tim Allison  wrote:

> I'm kicking off the regression tests for 3.x.
>
> Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope
> that's ok.
>
> On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr 
> wrote:
>
> > Hi,
> >
> > No opinion re release schedule but a comment on the PDFBox update:
> >
> > tl;dr: ignore the PDF differences this time.
> >
> > The new version includes the /ActualText support:
> > https://issues.apache.org/jira/browse/PDFBOX-5868
> >
> > It is always enabled. In most cases the extraction is better. But
> > sometimes content is lost because the feature is used for obfuscation
> > (see example in the issue above).
> >
> > Another major change is the detection of the space width:
> > https://issues.apache.org/jira/browse/PDFBOX-5920
> > It has been improved, however this will result in many differences with
> > angled texts if angle detection isn't enabled. Some scientific texts
> > with superscript prefix will also look different, "1 Coupled" will
> > extract as "1Coupled". This is because these fonts don't have a space
> > and the fallback we are using sucks.
> >
> > Tilman
> >
> > On 16.01.2025 14:20, Tim Allison wrote:
> > > Sorry, on second thought, a small tweak:
> > >
> > > I propose that we release 3.1.0 after PDFBox 3.x is released. I further
> > > propose that we make a 2.9.3 release at some point after the 3.1.0
> > release
> > > IF we get requests for a 2.x release...otherwise we'll do a final 2.x
> EOL
> > > release in April, 2025.
> > >
> > > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison 
> wrote:
> > >
> > >> All,
> > >>It has been a while since we last released 2.x (April 2024) and 3.x
> > >> (October 2024). We've had a number of dependency updates. PDFBox is on
> > the
> > >> cusp of a new 3.x release.
> > >>I propose that we release 3.1.0 after PDFBox 3.x is released and
> > that we
> > >> make a 2.9.3 release the following week.
> > >>WDYT?
> > >>
> > >>  Best,
> > >>
> > >>   Tim
> > >>
> >
> >
>


Re: Release schedule for 2.x and 3.x?

2025-01-27 Thread Tim Allison
I'm kicking off the regression tests for 3.x.

Nicholas, I merged TIKA-4303 and cherry-picked it back to 3.x. I hope
that's ok.

On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr 
wrote:

> Hi,
>
> No opinion re release schedule but a comment on the PDFBox update:
>
> tl;dr: ignore the PDF differences this time.
>
> The new version includes the /ActualText support:
> https://issues.apache.org/jira/browse/PDFBOX-5868
>
> It is always enabled. In most cases the extraction is better. But
> sometimes content is lost because the feature is used for obfuscation
> (see example in the issue above).
>
> Another major change is the detection of the space width:
> https://issues.apache.org/jira/browse/PDFBOX-5920
> It has been improved, however this will result in many differences with
> angled texts if angle detection isn't enabled. Some scientific texts
> with superscript prefix will also look different, "1 Coupled" will
> extract as "1Coupled". This is because these fonts don't have a space
> and the fallback we are using sucks.
>
> Tilman
>
> On 16.01.2025 14:20, Tim Allison wrote:
> > Sorry, on second thought, a small tweak:
> >
> > I propose that we release 3.1.0 after PDFBox 3.x is released. I further
> > propose that we make a 2.9.3 release at some point after the 3.1.0
> release
> > IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL
> > release in April, 2025.
> >
> > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison  wrote:
> >
> >> All,
> >>It has been a while since we last released 2.x (April 2024) and 3.x
> >> (October 2024). We've had a number of dependency updates. PDFBox is on
> the
> >> cusp of a new 3.x release.
> >>I propose that we release 3.1.0 after PDFBox 3.x is released and
> that we
> >> make a 2.9.3 release the following week.
> >>WDYT?
> >>
> >>  Best,
> >>
> >>   Tim
> >>
>
>


Re: Release schedule for 2.x and 3.x?

2025-01-24 Thread Tim Allison
This is very helpful. Thank you, Tilman!

On Fri, Jan 24, 2025 at 2:25 AM Tilman Hausherr 
wrote:

> Hi,
>
> No opinion re release schedule but a comment on the PDFBox update:
>
> tl;dr: ignore the PDF differences this time.
>
> The new version includes the /ActualText support:
> https://issues.apache.org/jira/browse/PDFBOX-5868
>
> It is always enabled. In most cases the extraction is better. But
> sometimes content is lost because the feature is used for obfuscation
> (see example in the issue above).
>
> Another major change is the detection of the space width:
> https://issues.apache.org/jira/browse/PDFBOX-5920
> It has been improved, however this will result in many differences with
> angled texts if angle detection isn't enabled. Some scientific texts
> with superscript prefix will also look different, "1 Coupled" will
> extract as "1Coupled". This is because these fonts don't have a space
> and the fallback we are using sucks.
>
> Tilman
>
> On 16.01.2025 14:20, Tim Allison wrote:
> > Sorry, on second thought, a small tweak:
> >
> > I propose that we release 3.1.0 after PDFBox 3.x is released. I further
> > propose that we make a 2.9.3 release at some point after the 3.1.0
> release
> > IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL
> > release in April, 2025.
> >
> > On Thu, Jan 16, 2025 at 8:15 AM Tim Allison  wrote:
> >
> >> All,
> >>It has been a while since we last released 2.x (April 2024) and 3.x
> >> (October 2024). We've had a number of dependency updates. PDFBox is on
> the
> >> cusp of a new 3.x release.
> >>I propose that we release 3.1.0 after PDFBox 3.x is released and
> that we
> >> make a 2.9.3 release the following week.
> >>WDYT?
> >>
> >>  Best,
> >>
> >>   Tim
> >>
>
>


Re: Release schedule for 2.x and 3.x?

2025-01-23 Thread Tilman Hausherr

Hi,

No opinion re release schedule but a comment on the PDFBox update:

tl;dr: ignore the PDF differences this time.

The new version includes the /ActualText support:
https://issues.apache.org/jira/browse/PDFBOX-5868

It is always enabled. In most cases the extraction is better. But 
sometimes content is lost because the feature is used for obfuscation 
(see example in the issue above).


Another major change is the detection of the space width:
https://issues.apache.org/jira/browse/PDFBOX-5920
It has been improved, however this will result in many differences with 
angled texts if angle detection isn't enabled. Some scientific texts 
with superscript prefix will also look different, "1 Coupled" will 
extract as "1Coupled". This is because these fonts don't have a space 
and the fallback we are using sucks.


Tilman

On 16.01.2025 14:20, Tim Allison wrote:

Sorry, on second thought, a small tweak:

I propose that we release 3.1.0 after PDFBox 3.x is released. I further
propose that we make a 2.9.3 release at some point after the 3.1.0 release
IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL
release in April, 2025.

On Thu, Jan 16, 2025 at 8:15 AM Tim Allison  wrote:


All,
   It has been a while since we last released 2.x (April 2024) and 3.x
(October 2024). We've had a number of dependency updates. PDFBox is on the
cusp of a new 3.x release.
   I propose that we release 3.1.0 after PDFBox 3.x is released and that we
make a 2.9.3 release the following week.
   WDYT?

 Best,

  Tim





Re: Release schedule for 2.x and 3.x?

2025-01-16 Thread Tim Allison
Sorry, on second thought, a small tweak:

I propose that we release 3.1.0 after PDFBox 3.x is released. I further
propose that we make a 2.9.3 release at some point after the 3.1.0 release
IF we get requests for a 2.x release...otherwise we'll do a final 2.x EOL
release in April, 2025.

On Thu, Jan 16, 2025 at 8:15 AM Tim Allison  wrote:

> All,
>   It has been a while since we last released 2.x (April 2024) and 3.x
> (October 2024). We've had a number of dependency updates. PDFBox is on the
> cusp of a new 3.x release.
>   I propose that we release 3.1.0 after PDFBox 3.x is released and that we
> make a 2.9.3 release the following week.
>   WDYT?
>
> Best,
>
>  Tim
>