[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-24 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849298#comment-17849298 ] Tim Allison commented on TIKA-4260: --- That PR currently only works on tika-core. More needs to be done

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849288#comment-17849288 ] Tim Allison commented on TIKA-4243: --- [~ndipiazza], I added parseContext to fetchers and emitters

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849103#comment-17849103 ] Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM: Proposed basic

[jira] [Created] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4260: - Summary: Add parse context to the fetcher interface in 3.x Key: TIKA-4260 URL: https://issues.apache.org/jira/browse/TIKA-4260 Project: Tika Issue Type: Task

[jira] [Created] (TIKA-4259) Decouple xml parser stuff from ParseContext

2024-05-23 Thread Tim Allison (Jira)
Tim Allison created TIKA-4259: - Summary: Decouple xml parser stuff from ParseContext Key: TIKA-4259 URL: https://issues.apache.org/jira/browse/TIKA-4259 Project: Tika Issue Type: Task

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849114#comment-17849114 ] Tim Allison commented on TIKA-4243: --- I'm going to start working on PRs that will be generally helpful

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849108#comment-17849108 ] Tim Allison commented on TIKA-4243: --- The downsides we see: a) if we there's agreement to add jackson

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849103#comment-17849103 ] Tim Allison commented on TIKA-4243: --- Proposed basic roadmap: Serialize ParseContext as is... Allow

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-23 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849101#comment-17849101 ] Tim Allison commented on TIKA-4243: --- Fellow devs, in chatting with Nicholas, we're thinking

[jira] [Resolved] (TIKA-4258) Multi-arch support for docker images

2024-05-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4258. --- Resolution: Fixed Just pushed 2.9.2.1/*-latest Thank you, all! > Multi-arch support for doc

multi-arch support for tika-docker!

2024-05-21 Thread Tim Allison
All, Many thanks to the many community members who helped figure this out and get it out the door! As of tika-docker 2.9.2.1, we now have multi-arch support (and on noble!). Let us know if there are any surprises. Thank you, again! Cheers, Tim Ref:

[jira] [Commented] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847980#comment-17847980 ] Tim Allison commented on TIKA-4255: --- Thank you for opening this PR. Are you able to add a small unit

[jira] [Resolved] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4256. --- Fix Version/s: 3.0.0 Resolution: Fixed > Allow inlining of ocr'd text in container docum

[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847950#comment-17847950 ] Tim Allison commented on TIKA-4258: --- I'm sure I'll need to modify the PR when I actually go to run

[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847949#comment-17847949 ] Tim Allison commented on TIKA-4258: --- Let's give it a day for fellow devs to weigh

[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847943#comment-17847943 ] Tim Allison commented on TIKA-4258: --- And here's the full version: https://hub.docker.com/layers/apache

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847931#comment-17847931 ] Tim Allison commented on TIKA-4243: --- Separately, but related to this and also to TIKA-4252 -- should we

[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847883#comment-17847883 ] Tim Allison commented on TIKA-4258: --- Helpful links from #infra: https://infra.apache.org/docker-hub

[jira] [Commented] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847882#comment-17847882 ] Tim Allison commented on TIKA-4258: --- If fellow devs with better knowledge of github actions and docker

[jira] [Created] (TIKA-4258) Multi-arch support for docker images

2024-05-20 Thread Tim Allison (Jira)
Tim Allison created TIKA-4258: - Summary: Multi-arch support for docker images Key: TIKA-4258 URL: https://issues.apache.org/jira/browse/TIKA-4258 Project: Tika Issue Type: Task

[jira] [Updated] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4256: -- Description: For legacy tika, we're inlining all content from embedded files including ocr content

[jira] [Updated] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4256: -- Description: For legacy tika, we're inlining all content from embedded files including ocr content

[jira] [Created] (TIKA-4256) Allow inlining of ocr'd text in container document

2024-05-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4256: - Summary: Allow inlining of ocr'd text in container document Key: TIKA-4256 URL: https://issues.apache.org/jira/browse/TIKA-4256 Project: Tika Issue Type: Task

[jira] [Commented] (TIKA-4137) Building current Tika main branch fails under Java 20/21

2024-05-15 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846697#comment-17846697 ] Tim Allison commented on TIKA-4137: --- Y, done just now. > Building current Tika main branch fails un

[jira] [Updated] (TIKA-4137) Building current Tika main branch fails under Java 20/21

2024-05-15 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4137: -- Fix Version/s: 2.9.3 > Building current Tika main branch fails under Java 20

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845081#comment-17845081 ] Tim Allison commented on TIKA-4252: --- fetchRequestMetadata, fetchResponseMetadata? > PipesClient#proc

[jira] [Comment Edited] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072 ] Tim Allison edited comment on TIKA-4252 at 5/9/24 5:14 PM: --- fetcher.fetch(String

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845072#comment-17845072 ] Tim Allison commented on TIKA-4252: --- fetcher.fetch(String key, Metadata writeMetadata, Metadata

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845068#comment-17845068 ] Tim Allison commented on TIKA-4252: --- Should we add an optional Metadata object to the FetchKey. We could

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845062#comment-17845062 ] Tim Allison commented on TIKA-4252: --- K, but you don't want that coming back and being populated

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845051#comment-17845051 ] Tim Allison commented on TIKA-4252: --- Or, if you mean that metadata gathered from the fetcher isn't

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845048#comment-17845048 ] Tim Allison commented on TIKA-4252: --- My initial thought for injecting user metadata was to pass through

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845047#comment-17845047 ] Tim Allison commented on TIKA-4252: --- I opened this branch: https://github.com/apache/tika/tree/TIKA-4252

[jira] [Reopened] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-4252: --- I pointed you to the wrong part of the code ... sorry. The design goal was to overwrite the extracted

[jira] [Commented] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845022#comment-17845022 ] Tim Allison commented on TIKA-4253: --- This is happening in the unit tests because there are multiple

[jira] [Created] (TIKA-4253) Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests

2024-05-09 Thread Tim Allison (Jira)
Tim Allison created TIKA-4253: - Summary: Duplicate parsers loaded in AutoDetectParser in 3.x at least in some unit tests Key: TIKA-4253 URL: https://issues.apache.org/jira/browse/TIKA-4253 Project: Tika

[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844998#comment-17844998 ] Tim Allison commented on TIKA-4252: --- Good catch: https://github.com/apache/tika/blob/main/tika-core/src

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976 ] Tim Allison edited comment on TIKA-4250 at 5/9/24 12:59 PM: libpst issue

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-09 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844976#comment-17844976 ] Tim Allison commented on TIKA-4250: --- libpff issue opened: https://github.com/libyal/libpff/issues/128

3.0.0-BETA2 release?

2024-05-07 Thread Tim Allison
All, I'd like to go for another 3.x beta release and then move fairly quickly to a 3.0.0 release. I was hoping that https://issues.apache.org/jira/browse/TIKA-4221 would be wrapped up soon. It hasn't been, but I can add the workaround we did in 2.x. What do you think? Any blockers?

[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4251: -- Description: I was recently working a bit on incubator-stormcrawler, and I noticed that they are using

[jira] [Updated] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4251: -- Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

[jira] [Created] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin

2024-05-06 Thread Tim Allison (Jira)
Tim Allison created TIKA-4251: - Summary: [DISCUSS] move to cosium's git-code-format-maven-plugin Key: TIKA-4251 URL: https://issues.apache.org/jira/browse/TIKA-4251 Project: Tika Issue Type

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843746#comment-17843746 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 5:03 PM: --- Wait, so

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843798#comment-17843798 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 5:02 PM: --- So, I caught

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843798#comment-17843798 ] Tim Allison commented on TIKA-4250: --- So, I caught an example of libpst not reading an attachment in our

[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4250: -- Attachment: 8.eml > Add a libpst-based parser > - > >

[jira] [Updated] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4250: -- Attachment: 8.msg > Add a libpst-based parser > - > >

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843740#comment-17843740 ] Tim Allison edited comment on TIKA-4250 at 5/6/24 1:02 PM: --- Wow. This is super

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843428#comment-17843428 ] Tim Allison commented on TIKA-4250: --- Given your experience, I think it would be valuable to add libpff

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843361#comment-17843361 ] Tim Allison commented on TIKA-4250: --- Hahahahaha. I figured you'd have input on this [~lfcnassif]! Y

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-03 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843217#comment-17843217 ] Tim Allison commented on TIKA-4249: --- > Crystal ball is murky on the timing of the next 2.x and

[jira] [Created] (TIKA-4250) Add a libpst-based parser

2024-05-02 Thread Tim Allison (Jira)
Tim Allison created TIKA-4250: - Summary: Add a libpst-based parser Key: TIKA-4250 URL: https://issues.apache.org/jira/browse/TIKA-4250 Project: Tika Issue Type: Task Reporter: Tim

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842745#comment-17842745 ] Tim Allison commented on TIKA-4249: --- Version numbers for the fix are noted above: 2.9.3 and 3.0.0

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-05-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842605#comment-17842605 ] Tim Allison commented on TIKA-4243: --- Do we put it in tika-serialization or a new module? > t

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-05-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842604#comment-17842604 ] Tim Allison commented on TIKA-4249: --- The example file shared was actually kind of weird. I looked like

[jira] [Updated] (TIKA-4249) EML file is treating it as text file in 2.9.2 version

2024-05-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4249: -- Summary: EML file is treating it as text file in 2.9.2 version (was: EML file is treating it as text

[jira] [Resolved] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-05-01 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4249. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed > EML file is treat

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842405#comment-17842405 ] Tim Allison commented on TIKA-4249: --- Files never cease to amaze! Thank you. Onwards! > EML f

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842402#comment-17842402 ] Tim Allison commented on TIKA-4249: --- Modifying the first hit from {{offset="0"}} to {{o

[jira] [Commented] (TIKA-4249) EML file is treating it as text file in 3.9.2 version

2024-04-30 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842401#comment-17842401 ] Tim Allison commented on TIKA-4249: --- I'm guessing you mean 2.9.0->2.9.2. The challenge with this f

[jira] [Created] (TIKA-4248) Improve PST handling of attachments

2024-04-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-4248: - Summary: Improve PST handling of attachments Key: TIKA-4248 URL: https://issues.apache.org/jira/browse/TIKA-4248 Project: Tika Issue Type: Task

Re: Bump dependabot to weekly?

2024-04-29 Thread Tim Allison
https://github.com/apache/tika/commit/63b7e91477d1dcdb0a5535dd4a008a3562a0609b W00t. Thank you, Tilman! On Mon, Apr 29, 2024 at 10:58 AM Tilman Hausherr wrote: > Yes! > > Tilman > > On 29.04.2024 16:55, Tim Allison wrote: > > Oh, interesting. Should we bump t

Re: Bump dependabot to weekly?

2024-04-29 Thread Tim Allison
: > The positive side is that it's less interruptions. > One negative side is that there seems to be a maximum. Today it didn't > report the AWS update, which was detected in the past. > Tilman > > On 29.04.2024 16:34, Tim Allison wrote: > > The move to weekly dependabot has been

Re: Bump dependabot to weekly?

2024-04-29 Thread Tim Allison
The move to weekly dependabot has been a bit of a relief for me personally. Our mail list isn't clogged w daily dependabot updates (and yes, I know I can apply a filter :/). How is it working for everyone else? On Wed, Apr 10, 2024 at 4:09 PM Tim Allison wrote: > >you start deletin

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841252#comment-17841252 ] Tim Allison commented on TIKA-4243: --- https://json-schema.org/learn/getting-started-step-by-step Yes

Re: How to proceed when you are getting OSS index errors?

2024-04-26 Thread Tim Allison
Worst case scenario, or if you're building older releases: mvn clean install -Dossindex.skip On Mon, Apr 22, 2024 at 10:35 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > thanks I'll pull latest > appreciate your help. > > On Mon, Apr 22, 2024 at 9:30 AM Tilman Hausherr > wrote:

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242 ] Tim Allison edited comment on TIKA-4243 at 4/26/24 1:32 PM: I really, really

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841243#comment-17841243 ] Tim Allison commented on TIKA-4243: --- Oh, sorry. Does this break anything? Can we add this as a new

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242 ] Tim Allison commented on TIKA-4243: --- I really, really want to clean up our configuration, and moving

[jira] [Comment Edited] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221 ] Tim Allison edited comment on TIKA-4245 at 4/26/24 1:23 PM: Oops, sorry. I

[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221 ] Tim Allison commented on TIKA-4245: --- Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one

[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841220#comment-17841220 ] Tim Allison commented on TIKA-4245: --- This is an ongoing area for improvement in Tika. The algorithm

Re: Question about tika-pipes FileSystemFetcher configuration options

2024-04-26 Thread Tim Allison
That's not possible yet. Please open an issue on our JIRA...you may need to request an account(?). On Fri, Apr 26, 2024 at 6:01 AM Emil Zegers wrote: > Hi, > > I'm looking for information if it is possible to configure > FileSystemFetcher for tika-pipes to only process certain files, e.g. based

[jira] [Resolved] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4244. --- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Thank you [~boomxlucifer

[jira] [Commented] (TIKA-4244) Tika idenifies MIME type of ics files with html content as text/html

2024-04-25 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840852#comment-17840852 ] Tim Allison commented on TIKA-4244: --- Thank you [~boomxlucifer] for finding this and reporting

[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-04-22 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839780#comment-17839780 ] Tim Allison commented on TIKA-4166: ---  Thank you! > dependency updates for Tika

[jira] [Resolved] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4242. --- Resolution: Fixed > Tika depends on non-existing plexus-utils vers

[jira] [Commented] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838260#comment-17838260 ] Tim Allison commented on TIKA-4242: --- Looks like the reason we haven't found this problem is that we

[jira] [Commented] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837806#comment-17837806 ] Tim Allison commented on TIKA-4241: --- They add a custom key in the trailer {{/AdditionalStreams}} whose

[jira] [Updated] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4241: -- Attachment: testPDF_additionalStreams.pdf > Consider handling LibreOffice's /AdditionalStreams &quo

[jira] [Created] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)
Tim Allison created TIKA-4241: - Summary: Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs Key: TIKA-4241 URL: https://issues.apache.org/jira/browse

[jira] [Updated] (TIKA-4241) Consider handling LibreOffice's /AdditionalStreams "hybrid PDF" attachment embedding in PDFs

2024-04-16 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4241: -- Description: Some info here: https://stackoverflow.com/questions/67358370/what-the-standard-used

junk cves -- rant

2024-04-11 Thread Tim Allison
. And please, oh, please don't tell me that the llms are responsible for this! I'm hoping this is a post report echo artifact and not the cause of this report. https://gist.github.com/LLM4IG/6614bfa658295d7af07a6d37e06db27f -- Forwarded message - From: Tim Allison Date: Thu, Apr 11, 2024

[jira] [Commented] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836228#comment-17836228 ] Tim Allison commented on TIKA-4240: --- Thank you, [~tilman]! Should I revert to daily? > Cha

[jira] [Resolved] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4240. --- Resolution: Fixed Let's see how this goes. Thank you! > Change dependabot to wee

[jira] [Created] (TIKA-4240) Change dependabot to weekly

2024-04-11 Thread Tim Allison (Jira)
Tim Allison created TIKA-4240: - Summary: Change dependabot to weekly Key: TIKA-4240 URL: https://issues.apache.org/jira/browse/TIKA-4240 Project: Tika Issue Type: Task Reporter: Tim

Re: Bump dependabot to weekly?

2024-04-10 Thread Tim Allison
aily because this way we can learn ASAP if there are > > troubles with new dependency versions, although I'm now too busy. > > > > Tilman > > > > > > > > -- Original-Nachricht -- > > Von: Tim Allison > > Betreff: Bump dependabot to weekly? > > Da

Bump dependabot to weekly?

2024-04-10 Thread Tim Allison
All, Tilman has been doing heroic work keeping us up to date with dependabot's PRs. Given our pace of releases, would it make sense to backoff to weekly updates? Before running regression tests, we'd run the update plugin to make sure that we're up to date. What do you think? Best,

Re: Checkstyle - ignore line length or just use a bigger value

2024-04-10 Thread Tim Allison
; quality of life much better > > On Wed, Apr 10, 2024, 10:03 AM Tim Allison wrote: > > > I bumped line length to 180 from 120. Let's see if that's enough. > > > > I'm not sure what the best option is for chained method calls? > > "Chained method calls" ->

Re: Checkstyle - ignore line length or just use a bigger value

2024-04-10 Thread Tim Allison
ormatter new line settings > to allow multi-line streaming expressions > > builder() > .name("nick") > .someOtherStuff("doIt") > .build() > > right now the formatter turns that into 1 line > > On Wed, Apr 10, 2024 at 5:06 AM Tim Allison

Re: Checkstyle - ignore line length or just use a bigger value

2024-04-10 Thread Tim Allison
Sounds good. What length? On Wed, Apr 10, 2024 at 1:18 AM Nicholas DiPiazza wrote: > > can we bump up the line break to a more reasonable number? > some of the stream expressions start to wrap and wrap and warp forcing me > to use smaller variable names or break down into methods when i'd

Java 17 for 3.x?

2024-04-09 Thread Tim Allison
/c330b12h1fvmq8x1099mgw3tfs0gcp6q On Mon, Apr 8, 2024 at 12:09 PM Tim Allison wrote: > > From October 2023: > https://www.brilworks.com/blog/java-11-countdown-to-end-of-support/ > > Getting 3.x out has taken longer than I had anticipated. Should we > reopen the 17 vs 11 discussion given Eric

Re: Document chunking

2024-04-09 Thread Tim Allison
tps://github.com/infiniflow/ragflow which might also > > have some interesting chunking approaches. > > > > Thanks > > > > Michael > > > > Am 09.04.24 um 01:25 schrieb Nick Burch: > >> On Mon, 8 Apr 2024, Tim Allison wrote: > >>> Not sure

Document chunking

2024-04-08 Thread Tim Allison
Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated yet: https://github.com/Filimoa/open-parse

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
es that are in recent Java versions that we know about? > > > On Apr 8, 2024, at 7:02 AM, Tim Allison wrote: > > > > Sorry, more correctly: > > > > OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0 > > requires Java 17 and our 3.x is still o

Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
Sorry, more correctly: OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0 requires Java 17 and our 3.x is still on 11. On Mon, Apr 8, 2024 at 6:30 AM Tim Allison wrote: > > All, > As Brian pointed out, optimaize is no longer maintained, and it has > some depende

Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Tim Allison
ate the tika process in it’s own heap space as a separate java process rather than adding it to our app, but I suppose we could work around that Thank you Brian Laskey From: Tim Allison Reply-To: "u...@tika.apache.org" Date: Friday, March 8, 2024 at 9:44 AM To: "u...@tika.ap

Tika 3.0.0-BETA2?

2024-04-08 Thread Tim Allison
All, I'm now thinking it would make sense to have one more 3.x beta release before the final 3.0.0. Are there any breaking changes that we want to get into 3.x? I'd like to wait for COMPRESS-675 to be fixed and for COMPRESS-674 to be released before we release 3.0.0-BETA2. Any other items that

[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-04-06 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-4233: -- Fix Version/s: (was: 3.0.0) > Check tika-helm for deprecated k8s A

  1   2   3   4   5   6   7   8   9   10   >