Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207

I think I'll be wiring this into the /pipes and /async endpoints. The json
request will specify that you want bytes AND text+metadata.

There will be two options:
a) you specify two emitters: one for json and one for raw bytes
b) you specify one emitter, and the json and raw bytes are packaged in a zip

I'd really appreciate feedback on the design of this feature and any help
finding bugs!

Best,

        Tim

Cheers,

        Tim

On Tue, Mar 12, 2024 at 3:17 AM Zig Zag <[email protected]> wrote:

> Hi All,
>
> I am trying to build a pipeline that needs to process content recursively
> and store the binary bytes of all embedded children in addition to their
> text and other metadata.
>
>  I was looking at two options:
>
> 1. using Tika's /rmeta API and having my code just call it synchronously -
> is there a way for me to get bytes for embedded children when doing this ?
> basically some way to smoosh together what /unpack/all does into /rmeta.
> -   if it's not built-in any guidance on extending my own recursive
> handler to do this ?. i'd like to keep tika-server as is and just configure
> this extension so I can keep up with updates.
>
> 2. using /async or /pipes - with this I had 2 questions:
> - Is there emitter configuration to commit both bytes and text for all
> children ?
> - is there a way for me to pass in input with my HTTP request, and use a
> emitter only for storage (basically some sort of fetcher that uses the
> input request stream - this will help me avoid one external request).
>
> Thank you for any help!,
> Samuel
>

Reply via email to