Thank you Tim!, very kind of you to share - this gives us a good idea to
start - we will try with the /async  model and I'll let you know if we end
up needing a synchronous call at all.


On Tue, Mar 12, 2024 at 5:14 PM Tim Allison <[email protected]> wrote:

> Still haven’t opened PR, but working on the TIKA-4207 branch:
> https://github.com/apache/tika/tree/TIKA-4207
>
> The initial integration will be with /pipes and /async
>
> I’ll try to add something to /v2/unpack (?) later?
>
> On Tue, Mar 12, 2024 at 4:49 PM Zig Zag <[email protected]> wrote:
>
>> Thats great to hear Tim, thank you!. Will definitely provide feedback.
>>
>> While this get into 3.0 officially is there something I can prototype
>> with /rmeta to help me get my other stuff working - any suggestions on
>> approach or a draft PR for the official feature would be very helpful
>>
>> On Tue, Mar 12, 2024 at 5:53 AM Tim Allison <[email protected]> wrote:
>>
>>> Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207
>>>
>>> I think I'll be wiring this into the /pipes and /async endpoints. The
>>> json request will specify that you want bytes AND text+metadata.
>>>
>>> There will be two options:
>>> a) you specify two emitters: one for json and one for raw bytes
>>> b) you specify one emitter, and the json and raw bytes are packaged in a
>>> zip
>>>
>>> I'd really appreciate feedback on the design of this feature and any
>>> help finding bugs!
>>>
>>> Best,
>>>
>>>         Tim
>>>
>>> Cheers,
>>>
>>>         Tim
>>>
>>> On Tue, Mar 12, 2024 at 3:17 AM Zig Zag <[email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am trying to build a pipeline that needs to process content
>>>> recursively and store the binary bytes of all embedded children in addition
>>>> to their text and other metadata.
>>>>
>>>>  I was looking at two options:
>>>>
>>>> 1. using Tika's /rmeta API and having my code just call it
>>>> synchronously - is there a way for me to get bytes for embedded children
>>>> when doing this ? basically some way to smoosh together what /unpack/all
>>>> does into /rmeta.
>>>> -   if it's not built-in any guidance on extending my own recursive
>>>> handler to do this ?. i'd like to keep tika-server as is and just configure
>>>> this extension so I can keep up with updates.
>>>>
>>>> 2. using /async or /pipes - with this I had 2 questions:
>>>> - Is there emitter configuration to commit both bytes and text for all
>>>> children ?
>>>> - is there a way for me to pass in input with my HTTP request, and use
>>>> a emitter only for storage (basically some sort of fetcher that uses the
>>>> input request stream - this will help me avoid one external request).
>>>>
>>>> Thank you for any help!,
>>>> Samuel
>>>>
>>>

Reply via email to