Question about tika-pipes FileSystemFetcher configuration options

2024-04-26 Thread Emil Zegers
Hi, I'm looking for information if it is possible to configure FileSystemFetcher for tika-pipes to only process certain files, e.g. based on extension, match on file name/path or similar pattern. This way it would be possible to point to a specific root folder and only process matching files

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242 ] Tim Allison edited comment on TIKA-4243 at 4/26/24 1:32 PM: I really, really

[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841220#comment-17841220 ] Tim Allison commented on TIKA-4245: --- This is an ongoing area for improvement in Tika. The algorithm is

Re: Question about tika-pipes FileSystemFetcher configuration options

2024-04-26 Thread Tim Allison
That's not possible yet. Please open an issue on our JIRA...you may need to request an account(?). On Fri, Apr 26, 2024 at 6:01 AM Emil Zegers wrote: > Hi, > > I'm looking for information if it is possible to configure > FileSystemFetcher for tika-pipes to only process certain files, e.g. based

[jira] [Updated] (TIKA-4246) tika-pipes FileSystemFetcher configuration option for file name/path pattern selection

2024-04-26 Thread Emil Zegers (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emil Zegers updated TIKA-4246: -- Description: Would be useful to have the possibility to configure FileSystemFetcher for tika-pipes to

[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221 ] Tim Allison commented on TIKA-4245: --- Oops, sorry. I didn't realize you sent your tika-config.xml. Y, one

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841252#comment-17841252 ] Tim Allison commented on TIKA-4243: --- https://json-schema.org/learn/getting-started-step-by-step Yes,

[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Xiaohong Yang (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841209#comment-17841209 ] Xiaohong Yang commented on TIKA-4245: - [~tilman]  Can you detect the right charset (utf-8) and fix the

[jira] [Created] (TIKA-4246) tika-pipes FileSystemFetcher configuration option for file name/path pattern selection

2024-04-26 Thread Emil Zegers (Jira)
Emil Zegers created TIKA-4246: - Summary: tika-pipes FileSystemFetcher configuration option for file name/path pattern selection Key: TIKA-4246 URL: https://issues.apache.org/jira/browse/TIKA-4246

[jira] [Comment Edited] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841221#comment-17841221 ] Tim Allison edited comment on TIKA-4245 at 4/26/24 1:23 PM: Oops, sorry. I

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841243#comment-17841243 ] Tim Allison commented on TIKA-4243: --- Oh, sorry. Does this break anything? Can we add this as a new

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841242#comment-17841242 ] Tim Allison commented on TIKA-4243: --- I really, really want to clean up our configuration, and moving to

Re: How to proceed when you are getting OSS index errors?

2024-04-26 Thread Tim Allison
Worst case scenario, or if you're building older releases: mvn clean install -Dossindex.skip On Mon, Apr 22, 2024 at 10:35 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > thanks I'll pull latest > appreciate your help. > > On Mon, Apr 22, 2024 at 9:30 AM Tilman Hausherr > wrote: