I’m making progress on TIKA-4207, which will allow you to specify separate emitters for /rmeta like output and a separate emitter for the raw bytes from all embedded files.
That uses the /pipes or /async endpoints. After I finish that, I’ll try to add another endpoint that returns a zip with embedded raw bytes and the rmeta content. Not sure what to call that endpoint. Recommendations? On Thu, Mar 21, 2024 at 6:10 PM Tim Allison <talli...@apache.org> wrote: > If rmeta/text is not returning text extracted from embedded files that’s a > bug. > > I don’t think /rmeta/all is a thing. > > On Thu, Mar 21, 2024 at 5:21 PM Zig Zag <ziganda...@gmail.com> wrote: > >> Thanks Josh, thats correct but rmeta/text allows you to control this but >> it only returns one level of text (not documents embedded within others) - >> when you use the recursive interface rmeta/all it always returns content as >> HTML and similarly unpack/all returns meta as CSV. >> >> On Thu, Mar 21, 2024 at 1:40 PM Josh Burchard <burch...@pnp-hcl.com> >> wrote: >> >>> Samuel - Well, I use Tika server and I get my data back in JSON format >>> because I use the /rmeta/text endpoint and send the HTTP header >>> Accept:application/json. If you were to send Accept:text/plain would that >>> work for you? I've only done that in the context of the /tika endpoint and >>> that was long ago. Not sure how to do anything similar in the app because >>> I never use that. By the way, in the context of using the server I find >>> this table very helpful: >>> >>> >>> https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared >>> >>> >>> >>> >>> >>> >>> >>> From: "Zig Zag" <ziganda...@gmail.com> >>> To: user@tika.apache.org >>> Date: 03/21/2024 03:49 PM >>> Subject: Re: Meta output format of tika server /unpack/all >>> ------------------------------ >>> >>> >>> >>> [CAUTION: This email is from outside the organization. Unless you trust >>> the sender, don't click links or open attachments as it may be a phishing >>> email, which can steal your information and compromise your computer.] >>> >>> >>> Similarly is it possible to have /rmeta/all format content/text as text >>> instead of HTML? >>> >>> On Thu, Mar 21, 2024 at 9:50 AM Zig Zag <*ziganda...@gmail.com* >>> <ziganda...@gmail.com>> wrote: >>> Hi All, >>> >>> Is there a way to get the __META__ output of /unpack/all in a JSON >>> rather than CSV ? >>> >>> Thank you, >>> Samuel >>> >>>