I’m making progress on TIKA-4207, which will allow you to specify separate
emitters for /rmeta like output and a separate emitter for the raw bytes
from all embedded files.

That uses the /pipes or /async endpoints.

After I finish that, I’ll try to add another endpoint that returns a zip
with embedded raw bytes and the rmeta content.

Not sure what to call that endpoint. Recommendations?

On Thu, Mar 21, 2024 at 6:10 PM Tim Allison <talli...@apache.org> wrote:

> If rmeta/text is not returning text extracted from embedded files that’s a
> bug.
>
> I don’t think /rmeta/all is a thing.
>
> On Thu, Mar 21, 2024 at 5:21 PM Zig Zag <ziganda...@gmail.com> wrote:
>
>> Thanks Josh, thats correct but rmeta/text allows you to control this but
>> it only returns one level of text (not documents embedded within others) -
>> when you use the recursive interface rmeta/all it always returns content as
>> HTML and similarly unpack/all returns meta as CSV.
>>
>> On Thu, Mar 21, 2024 at 1:40 PM Josh Burchard <burch...@pnp-hcl.com>
>> wrote:
>>
>>> Samuel - Well, I use Tika server and I get my data back in JSON format
>>> because I use the /rmeta/text endpoint and send the HTTP header
>>> Accept:application/json.  If you were to send Accept:text/plain would that
>>> work for you? I've only done that in the context of the /tika endpoint and
>>> that was long ago.  Not sure how to do anything similar in the app because
>>> I never use that.  By the way, in the context of using the server I find
>>> this table very helpful:
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From:        "Zig Zag" <ziganda...@gmail.com>
>>> To:        user@tika.apache.org
>>> Date:        03/21/2024 03:49 PM
>>> Subject:        Re: Meta output format of tika server /unpack/all
>>> ------------------------------
>>>
>>>
>>>
>>> [CAUTION: This email is from outside the organization. Unless you trust
>>> the sender, don't click links or open attachments as it may be a phishing
>>> email, which can steal your information and compromise your computer.]
>>>
>>>
>>> Similarly is it possible to have /rmeta/all format content/text as text
>>> instead of HTML?
>>>
>>> On Thu, Mar 21, 2024 at 9:50 AM Zig Zag <*ziganda...@gmail.com*
>>> <ziganda...@gmail.com>> wrote:
>>> Hi All,
>>>
>>> Is there a way to get the __META__ output of /unpack/all in a JSON
>>> rather than CSV ?
>>>
>>> Thank you,
>>> Samuel
>>>
>>>

Reply via email to