Re: Tika server mode - /async task tracking

Tim Allison Thu, 05 Oct 2023 06:47:41 -0700

Sounds good.  If you want to get started with logging, use this as a
template:
https://github.com/tballison/tika-gui-v2/blob/main/tika-gui-app/src/main/resources/templates/log4j2/log4j2-async.xml
. Replace the {LOGS_PATH} stub.  This will be useful for pipes as well.


You can see how to get pipes to use this in the spawned processes here:
https://github.com/tballison/tika-gui-v2/blob/main/tika-gui-app/src/main/resources/templates/config/async.xml

Replace <async> with <pipes> and edit as appropriate.

Great to hear that someone is using pipes.  Please let us know if you have
any other questions.

Best,

      Tim

On Wed, Oct 4, 2023 at 5:09 AM Georgi Nikolov <[email protected]> wrote:

> Hello Tim,
>
> Thank you for the clarification. I have decided to stick to the pipes
> endpoint for now, because I can use the json response from tika api to at
> least detect when file X did not process properly. Will definitely explore
> the Pipes reporter and the unit tests, if I have something worthy
> documenting in the end I will give you a shout on here.
>
> Many thanks,
> Georgi
>
> On Tue, 3 Oct 2023, 20:38 Tim Allison, <[email protected]> wrote:
>
>> I'm sorry for my delay.
>>
>> At some point, I was thinking about implementing: /async/<task_id>  but I
>> gave up. The problem was that I didn't want to have to tie caching/storing
>> status info into tika-server or the async processor -- so I created a
>> configurable PipesReporter class...see below.
>>
>> If you set up logging carefully, and I need to document this better, you
>> can get a log per async subprocess which will include that "id" key and any
>> assorted stacktraces that were caught during the parse.  Those logs will
>> not tell you when a subprocess timed out or crashed, but they do offer rich
>> information.
>>
>> The other method (and I also have to document this better) is to specify
>> a PipesReporter in the async config section of the tika-config.xml file.
>> The pipes-reporter sits in the root process and is aware of both parse
>> exceptions and fatal crashes/timeouts.  I've implemented a couple:
>> JDBCPipesReporter (which I used quite a bit on some large processing jobs
>> with postgresql) and there's the LoggingPipesReporter.
>>
>> If you let me know which direction you'd like to head, I can focus on
>> documenting that portion. :D
>>
>> Or if you figure out what you need from our repo's unit tests and want to
>> document your findings, we'd be happy to have help with the documentation!
>>
>> Best,
>>
>>             Tim
>>
>>
>>
>> On Thu, Sep 28, 2023 at 7:08 AM Georgi Nikolov <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> I have been stuck on this for too long now I feel like, so I decided to
>>> try to get some information here.
>>>
>>> I would need Tika to extract content and metadata from thousands of
>>> files from S3. What I wanted to do is, have Tika running as a standalone
>>> server and use S3 fetchers and emitters in conjunction with  /async
>>>
>>> However I am having some difficulties to track what is going on the
>>> server side, my client code is in python using `python-tika`
>>>
>>> A payload is built programmatically and sent to /async endpoint, but I
>>> need to be able to track either the whole async task or individual tuples
>>> from it - have they failed, succeeded, or still running but I am
>>> struggling to achieve that, could not found any related information on
>>> whether this is possible, came across some information that it can be
>>> achieved by checking `/tika/async/<task_id>  and that when you send a put
>>> the response should contain X-Tika-id header, but none of these seem to
>>> work. Additionally from confluence:
>>>
>>> As default the fetchKey is used as the id for logging.  However, if
>>>> users need a distinct task id for the request, they may add an id
>>>>  element:
>>>>
>>>> {
>>>>     "id": "myTaskId",
>>>>     "fetcher": "fsf",
>>>>     "fetchKey": "hello_world.pdf",
>>>>     "emitter": "fse",
>>>>     "emitKey": "hello_world.pdf.json"
>>>> }
>>>>
>>>>
>>>>
>>> Is there a way to track the task id when running from /async looks like
>>> there is from all that I have seen so far but can't seem to figure out how
>>> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing
>>> happens - no resource, if i try to use /tika/async/<task_id> I get a 405.
>>>
>>> I have tried using /pipes which would capture errors etc and is handy
>>> but what about async ?
>>> /async doesn't seem to throw any errors no matter what actually happens
>>> in tika as long the payload is valid e.g processing errors, or bad cred
>>> errors for aws everything just gets skipped.
>>>
>>> Any pointers in the right direction will be welcome.
>>>
>>> Thanks,
>>> Georgi
>>>
>>

Re: Tika server mode - /async task tracking

Reply via email to