I'm sorry for my delay.

At some point, I was thinking about implementing: /async/<task_id>  but I
gave up. The problem was that I didn't want to have to tie caching/storing
status info into tika-server or the async processor -- so I created a
configurable PipesReporter class...see below.

If you set up logging carefully, and I need to document this better, you
can get a log per async subprocess which will include that "id" key and any
assorted stacktraces that were caught during the parse.  Those logs will
not tell you when a subprocess timed out or crashed, but they do offer rich
information.

The other method (and I also have to document this better) is to specify a
PipesReporter in the async config section of the tika-config.xml file. The
pipes-reporter sits in the root process and is aware of both parse
exceptions and fatal crashes/timeouts.  I've implemented a couple:
JDBCPipesReporter (which I used quite a bit on some large processing jobs
with postgresql) and there's the LoggingPipesReporter.

If you let me know which direction you'd like to head, I can focus on
documenting that portion. :D

Or if you figure out what you need from our repo's unit tests and want to
document your findings, we'd be happy to have help with the documentation!

Best,

            Tim



On Thu, Sep 28, 2023 at 7:08 AM Georgi Nikolov <[email protected]>
wrote:

> Hello,
>
> I have been stuck on this for too long now I feel like, so I decided to
> try to get some information here.
>
> I would need Tika to extract content and metadata from thousands of files
> from S3. What I wanted to do is, have Tika running as a standalone server
> and use S3 fetchers and emitters in conjunction with  /async
>
> However I am having some difficulties to track what is going on the server
> side, my client code is in python using `python-tika`
>
> A payload is built programmatically and sent to /async endpoint, but I
> need to be able to track either the whole async task or individual tuples
> from it - have they failed, succeeded, or still running but I am
> struggling to achieve that, could not found any related information on
> whether this is possible, came across some information that it can be
> achieved by checking `/tika/async/<task_id>  and that when you send a put
> the response should contain X-Tika-id header, but none of these seem to
> work. Additionally from confluence:
>
> As default the fetchKey is used as the id for logging.  However, if users
>> need a distinct task id for the request, they may add an id element:
>>
>> {
>>     "id": "myTaskId",
>>     "fetcher": "fsf",
>>     "fetchKey": "hello_world.pdf",
>>     "emitter": "fse",
>>     "emitKey": "hello_world.pdf.json"
>> }
>>
>>
>>
> Is there a way to track the task id when running from /async looks like
> there is from all that I have seen so far but can't seem to figure out how
> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing
> happens - no resource, if i try to use /tika/async/<task_id> I get a 405.
>
> I have tried using /pipes which would capture errors etc and is handy but
> what about async ?
> /async doesn't seem to throw any errors no matter what actually happens in
> tika as long the payload is valid e.g processing errors, or bad cred errors
> for aws everything just gets skipped.
>
> Any pointers in the right direction will be welcome.
>
> Thanks,
> Georgi
>

Reply via email to