Two follow ups... 1) TIKA-3227 was Dave Meikle's addition to skip embedded for /tika. Add a header X-Tika-Skip-Embedded with value 'true'. 2) You can get just the text content with /rmeta via /rmeta/text
On Thu, Jul 13, 2023 at 4:30 PM Tim Allison <[email protected]> wrote: > Sorry for my delay. > For /tika, I thought we had a way to tell it to parse only the primary > document and skip the attachments, but I can't figure out how to do that > quickly now. I'll look around some more. > > With /rmeta, try setting a header `maxEmbeddedResources:0` > > On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <[email protected]> wrote: > >> Hi, >> We're using /tika Docker endpoint with text/plain to extract file content >> for indexing in Elastic. >> >> If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only >> need to extract the text from the .msg body, not any of the attachments, as >> these are extracted from the .msg and handled separately. It now times out >> since it's massive amounts of text to process. >> >> I can't find any good examples for this for /tika, even on the excellent >> wiki at https://cwiki.apache.org/confluence/display/TIKA/TikaServer >> >> Everything I try results in the attachments being part of the file >> extraction output. >> I see there is a POST to /tika/form/main which sounds promising, but I >> can't get that to work. >> >> Using /rmeta does it as JSON/html, but we ideally only need the file >> content as plain text. >> >> Any ideas would be greatly appreciated! >> >> Regards, >> Willy Koch >> >>
