Re: Text extraction using /tika for container document only

Tim Allison Thu, 13 Jul 2023 13:30:34 -0700

Sorry for my delay.
For /tika, I thought we had a way to tell it to parse only the primary
document and skip the attachments, but I can't figure out how to do that
quickly now.  I'll look around some more.


With /rmeta, try setting a header `maxEmbeddedResources:0`

On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <[email protected]> wrote:

> Hi,
> We're using /tika Docker endpoint with text/plain to extract file content
> for indexing in Elastic.
>
> If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only need
> to extract the text from the .msg body, not any of the attachments, as
> these are extracted from the .msg and handled separately. It now times out
> since it's massive amounts of text to process.
>
> I can't find any good examples for this for /tika, even on the excellent
> wiki at  https://cwiki.apache.org/confluence/display/TIKA/TikaServer
>
> Everything I try results in the attachments being part of the file
> extraction output.
> I see there is a POST to /tika/form/main which sounds promising, but I
> can't get that to work.
>
> Using /rmeta does it as JSON/html, but we ideally only need the file
> content as plain text.
>
> Any ideas would be greatly appreciated!
>
> Regards,
> Willy Koch
>
>

Re: Text extraction using /tika for container document only

Reply via email to