Sorry for my delay. For /tika, I thought we had a way to tell it to parse only the primary document and skip the attachments, but I can't figure out how to do that quickly now. I'll look around some more.
With /rmeta, try setting a header `maxEmbeddedResources:0` On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <[email protected]> wrote: > Hi, > We're using /tika Docker endpoint with text/plain to extract file content > for indexing in Elastic. > > If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only need > to extract the text from the .msg body, not any of the attachments, as > these are extracted from the .msg and handled separately. It now times out > since it's massive amounts of text to process. > > I can't find any good examples for this for /tika, even on the excellent > wiki at https://cwiki.apache.org/confluence/display/TIKA/TikaServer > > Everything I try results in the attachments being part of the file > extraction output. > I see there is a POST to /tika/form/main which sounds promising, but I > can't get that to work. > > Using /rmeta does it as JSON/html, but we ideally only need the file > content as plain text. > > Any ideas would be greatly appreciated! > > Regards, > Willy Koch > >
