Filter within zip file

David Pilato Fri, 13 Sep 2024 06:59:09 -0700

Hey team,


I'm wondering if there is a way to filter the content being extracted by Tika 
using filenames for example.
Let say I have a zip file with foo.js, foo.pdf, foo.html, foo.png and I only 
want to extract text from the pdf and html files.

Also, I can see that a Zip is extracted this way as a full String:

"""
doc/ab1.js
CONTENT1
abc/abc2.pdf
CONTENT2
...
"""

Would it be possible to extract the content as separated Objects, something 
like:

```
[
{ "name": "doc/ab1.js", "content": "CONTENT1", "metadata": [ /* ... */ ] },
{ "name": "abc/abc2.pdf", "content": "CONTENT2", "metadata": [ /* ... */ ] },
...
]
```

Thanks!

Filter within zip file

Reply via email to