W00t! Thank you, Tilman!
And thank you @Maruan Sahyoun for hosting the
regression server!!!
On Wed, Mar 27, 2024 at 3:48 PM Tilman Hausherr
wrote:
> During the tika regression tests it turned out that there is a longer
> PDF list, so I ran the tests again with that longer list. The good news
Not a showstopper for me.
+1
Thank you!
On Fri, Mar 22, 2024 at 1:59 AM Maruan Sahyoun
wrote:
> IMHO this is not a show stopper
>
>
> > Am 22.03.2024 um 06:54 schrieb Andreas Lehmkühler
> :
> >
> >
> >
> >> Am 21.03.24 um 20:07 schrieb Tim Alliso
In the parent pom.xml in the zip file, there's a "release" submodule
specified. However, there's no release directory in the src zip that would
match: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.31/release/
Is that expected?
On Thu, Mar 21, 2024 at 1:53 PM Andreas Lehmkühler
wrote:
> Hi,
+1
Thank you!
On Tue, Nov 28, 2023 at 5:49 AM Timo Boehme
wrote:
> +1,
>
> Thanks,
> Timo
>
>
> Am 27.11.23 um 17:46 schrieb Andreas Lehmkühler:
> > Hi,
> >
> > a candidate for the PDFBox 3.0.1 release is available at:
> >
> > https://dist.apache.org/repos/dist/dev/pdfbox/3.0.1/
> >
> >
Doh. I didn't. Thank you, @Tilman!
On Wed, Nov 22, 2023 at 2:08 AM Andreas Lehmkühler
wrote:
> Hi,
>
> after fixing the latest regressions I'd like to cut the 3.0.1 release
> next Monday/Tuesday.
>
> WDYT?
>
> @Tim, @Tilman do you have the time to run the extraction tests 3.0.0 vs
> 3.0.1 ?
>
>
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766567#comment-17766567
]
Tim Allison commented on PDFBOX-5682:
-
Wow. Thank you!
> Long/permanent hang in PDFBox
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228
]
Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225
]
Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM:
--
Thank you
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228
]
Tim Allison commented on PDFBOX-5682:
-
This is the part from that document that is, erm, eye-opening
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225
]
Tim Allison commented on PDFBOX-5682:
-
Thank you, [~lehmi]. In Tika, we initially copied PDFBox's
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763903#comment-17763903
]
Tim Allison commented on PDFBOX-5682:
-
Both files spend quite a bit of time
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763904#comment-17763904
]
Tim Allison commented on PDFBOX-5682:
-
It looks like that causes a full parse of the file?
> L
[
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5682:
Summary: Long/permanent hang in PDFBox 3.x (was: Long/permanent hang i n
PDFBox 3.x)
> L
Tim Allison created PDFBOX-5682:
---
Summary: Long/permanent hang i n PDFBox 3.x
Key: PDFBOX-5682
URL: https://issues.apache.org/jira/browse/PDFBOX-5682
Project: PDFBox
Issue Type: Bug
[
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763759#comment-17763759
]
Tim Allison commented on PDFBOX-5681:
-
When I run the demo code in PDFBox trunk with logging on, I
[
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763754#comment-17763754
]
Tim Allison commented on PDFBOX-5681:
-
I initially thought this was a threading issue, but it isn't
Tim Allison created PDFBOX-5681:
---
Summary: ConcurrentModificationException in getObjectsByType() in
3.x
Key: PDFBOX-5681
URL: https://issues.apache.org/jira/browse/PDFBOX-5681
Project: PDFBox
[
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5681:
Affects Version/s: 3.0.0 PDFBox
> ConcurrentModificationException in getObjectsByType() in
[
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5681:
Description:
[~tilman]'s regression testing turned up this exception when we integrate
PDFBox
[
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5681:
Issue Type: Bug (was: Task)
> ConcurrentModificationException in getObjectsByType() in
https://aws.amazon.com/blogs/machine-learning/retain-original-pdf-formatting-to-view-translated-documents-with-amazon-textract-amazon-translate-and-pdfbox/
.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
On Wed, May 31, 2023 at 1:41 PM Tilman Hausherr
wrote:
> Yes please
>
> Thanks
>
> Tilman
>
> On 31.05.2023 17:15, Tim Allison wrote:
> > +1
> >
> > Let me know when/if
+1
Let me know when/if I should run the text extraction regression tests.
On Thu, May 25, 2023 at 12:32 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:
> +1
>
> Maruan
>
> Am Mittwoch, dem 24.05.2023 um 07:48 +0200 schrieb Andreas Lehmkuehler:
> > Hi,
> >
> > I tend to release 2.0.29
8 million PDFs/8TB from a month of Common Crawl. We refetched ~2
million truncated files.
Zips of PDFs are available here:
https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/
Peter Wyatt (PDF Association)'s writeup is here:
[
https://issues.apache.org/jira/browse/PDFBOX-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5595:
Description:
I'm not sure this is a regression, and apologies if you already dealt
Tim Allison created PDFBOX-5595:
---
Summary: Slight regression on corrupt bug tracker file
Key: PDFBOX-5595
URL: https://issues.apache.org/jira/browse/PDFBOX-5595
Project: PDFBox
Issue Type
+1
Thank you!
On Mon, Apr 10, 2023 at 12:08 PM sahy...@fileaffairs.de
wrote:
>
> +1
>
> Maruan
>
> Am Montag, dem 10.04.2023 um 12:15 +0200 schrieb Andreas Lehmkuehler:
> > Hi,
> >
> > a candidate for the PDFBox 2.0.28 release is available at:
> >
> >
on. At least the exception
> is
> gone, maybe there is some more content or just an empty page.
>
> However, IMHO that isn't a regression, but an (small) improvement.
>
> @Tim Thanks for running the tests
>
> Andreas
>
>
> Am 10.04.23 um 12:54 schrieb Tim Allison:
>
044)
On Mon, Apr 10, 2023 at 6:41 AM Tim Allison wrote:
>
> Y. Will start process now. Thank you!
>
> On Mon, Apr 10, 2023 at 6:20 AM Andreas Lehmkuehler wrote:
> >
> > Hi,
> >
> > I've finished the release process and provided a releases candidate for
> >
gt; >>>>> pdf well.
> >>>>>
> >>>>> IMHO we should leave it alone, as it is malformed anmd doesn't contain
> >>>>> any
> >>>>> useful content. More important, it is one pdf out of hundreds of
> &
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-20230403-reports.tgz
Haven't had a chance to take a look yet. :(
On Mon, Apr 3, 2023 at 6:53 AM Tilman Hausherr wrote:
>
> Don't wait please
> Thanks
> Tilman
>
>
>
> --- Original-Nachricht ---
> Von
Lehmkuehler:
> >>
> >> I've accidentally send this to Tim only :-|
> >>
> >> Weitergeleitete Nachricht
> >> Betreff: Re: 2.0.28 release?
> >> Datum: Fri, 31 Mar 2023 07:50:10 +0200
> >> Von: Andreas Lehmkuehler
> &
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-SNAPSHOT.tgz
On Tue, Mar 28, 2023 at 10:42 PM Tilman Hausherr wrote:
>
> Yes please!
>
> Thanks
>
> Tilman
>
> On 28.03.2023 19:22, Tim Allison wrote:
> > +1
> >
>
+1
Should I run the regression tests now or is there anything else text
related that is still being worked on?
On Tue, Mar 28, 2023 at 1:05 PM Tilman Hausherr wrote:
>
> +1
>
> Tilman
>
> On 28.03.2023 08:46, Andreas Lehmkuehler wrote:
> > Hi,
> >
> > how about cutting a 2.0.28 release next
[
https://issues.apache.org/jira/browse/PDFBOX-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5550:
Summary: reduce number of open files (was: redcuce number of open files)
> reduce number of o
[
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635337#comment-17635337
]
Tim Allison commented on PDFBOX-5540:
-
Should I kick that off now?
> export:text creates jibber
+1 clean build with jdk 8 on macOS Monterey. shasum checks out, and
this release candidate works with Tika's unit tests.
Thank you!
Cheers,
Tim
P.S. Should I open an issue to remove the printlns in AbstractSchemaTester?
On Mon, Sep 26, 2022 at 11:29 AM Andreas Lehmkuehler wrote:
>
> a
>PS thanks for doing the test!
Thank _you_ for the confirmation! Onwards!
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
As confirmation, did we expect more diffs/ did I botch the runs and
compare the same version? There are a few _diffs_ at least.
On Tue, Sep 20, 2022 at 8:41 AM Tim Allison wrote:
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-pre-rc.tgz
>
> L
gt; Andreas
>
> Am 19.09.22 um 21:09 schrieb Tim Allison:
> > I should have time tomorrow/Wednesday. Thank you!
> >
> > On Mon, Sep 19, 2022 at 2:30 PM Tilman Hausherr
> > wrote:
> >>
> >> On 19.09.2022 08:22, Andreas Lehmkuehler wrote:
> &g
I should have time tomorrow/Wednesday. Thank you!
On Mon, Sep 19, 2022 at 2:30 PM Tilman Hausherr wrote:
>
> On 19.09.2022 08:22, Andreas Lehmkuehler wrote:
> > 1.8.17 is out of the door and I guess it is time for 2.0.27 release.
> >
> > @Tim or @Tilman
> > Is there any chance to run the
[
https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602789#comment-17602789
]
Tim Allison commented on PDFBOX-5501:
-
Thank you!
> Jempbox is slow on xmp with large ev
[
https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved PDFBOX-5501.
-
Resolution: Not A Problem
Y. I just also confirmed that this is fixed in 1.8.17-SNAPSHOT
Tim Allison created PDFBOX-5501:
---
Summary: Jempbox is slow on xmp with large event histories
Key: PDFBOX-5501
URL: https://issues.apache.org/jira/browse/PDFBOX-5501
Project: PDFBox
Issue Type
All,
This issue is ringing a bell. I'm sorry if there's an open issue or
you/we've decided long ago that this is not an issue.
One of the timeouts in the most recent run was caused by Jempbox's
handling of the history in the media management schema. There are 32000
elements in the history.
[
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578904#comment-17578904
]
Tim Allison commented on PDFBOX-5490:
-
Y. Completely understand. I don't want to impede 3.0.0
[
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578510#comment-17578510
]
Tim Allison commented on PDFBOX-5490:
-
My initial request would be for whether or not the xref table
[
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578129#comment-17578129
]
Tim Allison commented on PDFBOX-5490:
-
Oh, that looks great.
> Add reconstruction informat
[
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578055#comment-17578055
]
Tim Allison commented on PDFBOX-5490:
-
A Listener would be great. Any mechanism that would allow
[
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5490:
Component/s: Parsing
> Add reconstruction information to the PDDocum
Tim Allison created PDFBOX-5490:
---
Summary: Add reconstruction information to the PDDocument
Key: PDFBOX-5490
URL: https://issues.apache.org/jira/browse/PDFBOX-5490
Project: PDFBox
Issue Type
I wouldn't. :D
On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr
wrote:
> Am 15.06.2022 um 12:19 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>
> govdocs1/372/372582.pdf
> commoncrawl3/KH/KHDACXIP
fetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN
On Wed, Jun 15, 2022 at 6:49 AM Tim Allison wrote:
> I had a chance to look at new_catastrophic_exceptions_in_b, and the three
> files in there take roughly the same amount of time and resources. I think
> they failed on trunk only b
AM Tim Allison wrote:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>
> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison wrote:
>
>> Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET).
>>
>> On Sat
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
On Mon, Jun 13, 2022 at 4:54 PM Tim Allison wrote:
> Just seeing this now. Y. I'll kick off the tests tomorrow morning (ET).
>
> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler
> wrote:
>
like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
> >>
> >> Maybe there are more to come
> >>
> >> Andreas
> >>
> >>
> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
> >>> Apologies for my delay. I ran trunk/3.x
...
>
> Andreas
>
>
> Am 26.05.22 um 15:04 schrieb Tim Allison:
> > Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. The
> > reports are here:
> >
> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
> >
> > Hap
:
> Am 06.05.22 um 14:30 schrieb Tim Allison:
> > All,
> >Let me know when makes sense to run the text extraction regression
> Yes, it'd be useful to have some update results.
>
> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
> 3.0.0-alpha3
[
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5431:
Description:
I noticed a new NPE in one of our test files on Tika when I recently built
PDFBox's
[
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5431:
Component/s: XmpBox
> New NPE in xmpbox parser in tr
[
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5431:
Affects Version/s: 3.0.0 PDFBox
> New NPE in xmpbox parser in tr
Tim Allison created PDFBOX-5431:
---
Summary: New NPE in xmpbox parser in trunk
Key: PDFBOX-5431
URL: https://issues.apache.org/jira/browse/PDFBOX-5431
Project: PDFBox
Issue Type: Task
All,
Let me know when makes sense to run the text extraction regression
tests for 3.x. I regret I haven't been following our mailing list as
closely as I should be.
Cheers,
Tim
-
To
+1
Thank you!
On Mon, Apr 18, 2022 at 10:50 AM sahy...@fileaffairs.de
wrote:
>
> +1
> Maruan
>
> Am Montag, dem 18.04.2022 um 13:14 +0200 schrieb Andreas Lehmkuehler:
> > Hi,
> >
> > a candidate for the PDFBox 2.0.26 release is available at:
> >
> >
[
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522531#comment-17522531
]
Tim Allison commented on PDFBOX-5415:
-
An answer on the Tika side. Yes, parsing is dangerous
[
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521382#comment-17521382
]
Tim Allison commented on PDFBOX-5415:
-
Michael Demey's diagnosis:
https://twitter.com/MyMilkedEek
[
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5415:
Affects Version/s: 2.0.26
> Infinite loop in ExtractText in 2.x branch on a specific
[
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5415:
Component/s: Parsing
> Infinite loop in ExtractText in 2.x branch on a specific
Tim Allison created PDFBOX-5415:
---
Summary: Infinite loop in ExtractText in 2.x branch on a specific
pdf
Key: PDFBOX-5415
URL: https://issues.apache.org/jira/browse/PDFBOX-5415
Project: PDFBox
Yes. Sorry. That's my fault. I did something stupid in 2.3.0 and then
fixed it in 2.4.0-SNAPSHOT (TIKA-3711).
On Mon, Apr 11, 2022 at 1:35 PM Tilman Hausherr
wrote:
> Am 11.04.2022 um 12:40 schrieb Tim Allison:
>
> https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz
&g
https://corpora.tika.apache.org/base/reports/tika-2.4-20220410.tgz
Haven't had a chance to review. Hot off the vm.
On Sun, Apr 10, 2022 at 9:58 AM Tim Allison wrote:
>
> Will try to kick off today…first thing Monday morning (EDT) at the latest.
>
> On Sun, Apr 10, 2022 at 9:0
Will try to kick off today…first thing Monday morning (EDT) at the latest.
On Sun, Apr 10, 2022 at 9:05 AM Andreas Lehmkuehler
wrote:
> Am 09.04.22 um 19:00 schrieb Tilman Hausherr:
> > testFlattenPDFBOX2469Filled also fails in 2.0 (it is disabled by
> default).
> I've fixed all new tickets.
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.26-snapshot-reports.tgz
I haven't had a chance to look at them yet.
On Thu, Apr 7, 2022 at 9:07 AM Andreas Lehmkühler wrote:
>
> Yes, please
>
> Thanks in advance
> Andreas
>
> 07.04.2022 11:44:38 Tim Allison :
>
&g
[
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved PDFBOX-5396.
-
Fix Version/s: 2.0.26
Resolution: Fixed
> Add maven enforcer rule to ens
Sounds great! Should I rerun the regression tests today?
On Thu, Apr 7, 2022 at 1:41 AM Andreas Lehmkuehler wrote:
> Hi,
>
> sorry for the delay. I'm planning to cut the 2.0.26 release next
> Saturday, the
> day after tomorrow, if nobody objects.
>
> Andreas
>
> P.S.: I'm targeting a new 3.0.0
[
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512474#comment-17512474
]
Tim Allison commented on PDFBOX-5401:
-
bq. Hi, I didn't test these samples on PDFBOX 2.0
Sorry, my
[
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
]
Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 4:38 PM:
---
I
[
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
]
Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 2:07 PM:
---
Can
[
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
]
Tim Allison commented on PDFBOX-5401:
-
Can confirm behavior with the last 2.0.26-SNAPSHOT I used
that on the Tika side.
On Mon, Mar 21, 2022 at 12:53 PM Andreas Lehmkuehler wrote:
>
>
> Am 21.03.22 um 12:21 schrieb Tim Allison:
> > I'm happy to run the tests today if that would be of any interest.
> Yes, please.
>
> TIA
> Andreas
>
>
> >
> > On Sun
[
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509892#comment-17509892
]
Tim Allison commented on PDFBOX-5396:
-
This is not a problem in trunk.
> Add maven enforcer r
[
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5396:
Description:
I recently stubbed my toe on this one again. At least in the 2.x branch
Tim Allison created PDFBOX-5396:
---
Summary: Add maven enforcer rule to ensure that JAVA_HOME is set
Key: PDFBOX-5396
URL: https://issues.apache.org/jira/browse/PDFBOX-5396
Project: PDFBox
Issue
I'm happy to run the tests today if that would be of any interest.
On Sun, Mar 20, 2022 at 5:01 PM Andreas Lehmkuehler wrote:
>
> Am 13.03.22 um 14:20 schrieb Tim Allison:
> > From Tika's perspective, there's no rush. We're waiting for a bug fix
> > in POI (TIKA-3699).
>
troduced in 2018 with
> 2.0.12. ;-)
>
> However, I'll have a look at the proposed solution.
>
> Andreas
> >
> > Tilman
> >
> >
> >>
> >> WDYT?
> >>
> >> Andreas
> >>
> >>>
> >>> Tilman
> &g
All,
I've been out of the office for a bit and haven't caught up yet.
Apologies if I've missed the discussion.
Are there plans for a 2.0.26 release? We're probably a few weeks out
from starting our next 1.x and 2.x releases on Tika, and it would be
great to incorporate 2.0.26. No problem at
Tim Allison created PDFBOX-5358:
---
Summary: Add support for UTF-8 in strings
Key: PDFBOX-5358
URL: https://issues.apache.org/jira/browse/PDFBOX-5358
Project: PDFBox
Issue Type: Improvement
Autocorrect!!! Tika
On Tue, Oct 12, 2021 at 4:42 PM Tim Allison wrote:
>
> https://www.wired.co.uk/article/pandora-papers-leak
>
> Repo:
> https://github.com/ICIJ/datashare/
-
To unsubscribe, e-mai
https://www.wired.co.uk/article/pandora-papers-leak
Repo:
https://github.com/ICIJ/datashare/
https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly
Not sure there's much we should do on the Tika side.
How hard would it be to add an "extract only text that is on the page" feature?
Best,
Tim
+1
Thank you!
On Mon, Jun 7, 2021 at 12:52 PM Andreas Lehmkuehler wrote:
>
> Hi,
>
> a candidate for the PDFBox 2.0.24 release is available at:
>
> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.24/
>
> The release candidate is a zip archive of the sources in:
>
>
,
Tim
On Mon, May 31, 2021 at 2:20 AM Andreas Lehmkuehler wrote:
>
> Am 30.05.21 um 20:13 schrieb Tim Allison:
> > Will kick off tests on Tuesday, June 1 unless there are other text
> > extraction changes planned.
> Cool, I'm currently working on some 3.0 tickets so no
Will kick off tests on Tuesday, June 1 unless there are other text
extraction changes planned.
On Sun, May 30, 2021 at 12:07 PM Andreas Lehmkuehler
wrote:
> I'm targeting the 7th or 8th of May.
>
> @Tim, @Tilman, is there any chance to run a 2.0.23 vs. 2.0.24 comparison
> first?
>
> Andreas
>
>
+1 :)
On Tue, May 25, 2021 at 2:20 AM Andreas Lehmkuehler
wrote:
> Hi,
>
> how about cutting a 2.0.24 release in about 2 weeks from now?
>
> There is already an amount of solved tickets and our friends from Tika are
> interested in a new version as well to cut a new release too including our
>
[
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326042#comment-17326042
]
Tim Allison commented on PDFBOX-5164:
-
Thank you, [~tilman]!
> Create portable collection
[
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325972#comment-17325972
]
Tim Allison commented on PDFBOX-5164:
-
Sorry to hijack this, but I wanted to confirm with [~zxltmj
[
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-5164:
Attachment: tika-output.json
> Create portable collection
Trust me... the doubt was on me, not you! :D
On Sat, Apr 17, 2021 at 5:15 AM Andreas Lehmkuehler
wrote:
> Am 16.04.21 um 01:15 schrieb Tim Allison:
> > Diffs look suspiciously small...I may have to rerun the analyses.
> We simply did a good job! ;-)
>
> Andreas
>
> &g
[
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324082#comment-17324082
]
Tim Allison commented on PDFBOX-5166:
-
Ha @bitsgalore has an example of subtype=Screen. Yay
There are a handful of files that "lose" attachments going into
3.0.0-SNAPSHOT because I haven't added the richmedia handling in our
3.0.0 branch.
Best,
Tim
On Thu, Apr 15, 2021 at 7:15 PM Tim Allison wrote:
>
> Diffs look suspiciously small...I may have to rerun the anal
1 - 100 of 557 matches
Mail list logo