Re: Text extraction using /tika for container document only

Tim Allison Fri, 14 Jul 2023 08:58:13 -0700

Two follow ups...

1) TIKA-3227 was Dave Meikle's addition to skip embedded for /tika.  Add a
header X-Tika-Skip-Embedded with value 'true'.
2) You can get just the text content with /rmeta via /rmeta/text



On Thu, Jul 13, 2023 at 4:30 PM Tim Allison <[email protected]> wrote:

> Sorry for my delay.
> For /tika, I thought we had a way to tell it to parse only the primary
> document and skip the attachments, but I can't figure out how to do that
> quickly now.  I'll look around some more.
>
> With /rmeta, try setting a header `maxEmbeddedResources:0`
>
> On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <[email protected]> wrote:
>
>> Hi,
>> We're using /tika Docker endpoint with text/plain to extract file content
>> for indexing in Elastic.
>>
>> If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only
>> need to extract the text from the .msg body, not any of the attachments, as
>> these are extracted from the .msg and handled separately. It now times out
>> since it's massive amounts of text to process.
>>
>> I can't find any good examples for this for /tika, even on the excellent
>> wiki at  https://cwiki.apache.org/confluence/display/TIKA/TikaServer
>>
>> Everything I try results in the attachments being part of the file
>> extraction output.
>> I see there is a POST to /tika/form/main which sounds promising, but I
>> can't get that to work.
>>
>> Using /rmeta does it as JSON/html, but we ideally only need the file
>> content as plain text.
>>
>> Any ideas would be greatly appreciated!
>>
>> Regards,
>> Willy Koch
>>
>>

Re: Text extraction using /tika for container document only

Reply via email to