Thank you Tim!, very kind of you to share - this gives us a good idea to start - we will try with the /async model and I'll let you know if we end up needing a synchronous call at all.
On Tue, Mar 12, 2024 at 5:14 PM Tim Allison <[email protected]> wrote: > Still haven’t opened PR, but working on the TIKA-4207 branch: > https://github.com/apache/tika/tree/TIKA-4207 > > The initial integration will be with /pipes and /async > > I’ll try to add something to /v2/unpack (?) later? > > On Tue, Mar 12, 2024 at 4:49 PM Zig Zag <[email protected]> wrote: > >> Thats great to hear Tim, thank you!. Will definitely provide feedback. >> >> While this get into 3.0 officially is there something I can prototype >> with /rmeta to help me get my other stuff working - any suggestions on >> approach or a draft PR for the official feature would be very helpful >> >> On Tue, Mar 12, 2024 at 5:53 AM Tim Allison <[email protected]> wrote: >> >>> Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207 >>> >>> I think I'll be wiring this into the /pipes and /async endpoints. The >>> json request will specify that you want bytes AND text+metadata. >>> >>> There will be two options: >>> a) you specify two emitters: one for json and one for raw bytes >>> b) you specify one emitter, and the json and raw bytes are packaged in a >>> zip >>> >>> I'd really appreciate feedback on the design of this feature and any >>> help finding bugs! >>> >>> Best, >>> >>> Tim >>> >>> Cheers, >>> >>> Tim >>> >>> On Tue, Mar 12, 2024 at 3:17 AM Zig Zag <[email protected]> wrote: >>> >>>> Hi All, >>>> >>>> I am trying to build a pipeline that needs to process content >>>> recursively and store the binary bytes of all embedded children in addition >>>> to their text and other metadata. >>>> >>>> I was looking at two options: >>>> >>>> 1. using Tika's /rmeta API and having my code just call it >>>> synchronously - is there a way for me to get bytes for embedded children >>>> when doing this ? basically some way to smoosh together what /unpack/all >>>> does into /rmeta. >>>> - if it's not built-in any guidance on extending my own recursive >>>> handler to do this ?. i'd like to keep tika-server as is and just configure >>>> this extension so I can keep up with updates. >>>> >>>> 2. using /async or /pipes - with this I had 2 questions: >>>> - Is there emitter configuration to commit both bytes and text for all >>>> children ? >>>> - is there a way for me to pass in input with my HTTP request, and use >>>> a emitter only for storage (basically some sort of fetcher that uses the >>>> input request stream - this will help me avoid one external request). >>>> >>>> Thank you for any help!, >>>> Samuel >>>> >>>
