Text extraction using /tika for container document only

Willy T. Koch Fri, 07 Jul 2023 02:06:05 -0700

Hi,
We're using /tika Docker endpoint with text/plain to extract file content for 
indexing in Elastic.


If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only need to 
extract the text from the .msg body, not any of the attachments, as these are 
extracted from the .msg and handled separately. It now times out since it's 
massive amounts of text to process.

I can't find any good examples for this for /tika, even on the excellent wiki 
at  https://cwiki.apache.org/confluence/display/TIKA/TikaServer

Everything I try results in the attachments being part of the file extraction 
output.
I see there is a POST to /tika/form/main which sounds promising, but I can't 
get that to work.

Using /rmeta does it as JSON/html, but we ideally only need the file content as 
plain text.

Any ideas would be greatly appreciated!

Regards,
Willy Koch

Text extraction using /tika for container document only

Reply via email to