Sounds good. If you want to get started with logging, use this as a template: https://github.com/tballison/tika-gui-v2/blob/main/tika-gui-app/src/main/resources/templates/log4j2/log4j2-async.xml . Replace the {LOGS_PATH} stub. This will be useful for pipes as well.
You can see how to get pipes to use this in the spawned processes here: https://github.com/tballison/tika-gui-v2/blob/main/tika-gui-app/src/main/resources/templates/config/async.xml Replace <async> with <pipes> and edit as appropriate. Great to hear that someone is using pipes. Please let us know if you have any other questions. Best, Tim On Wed, Oct 4, 2023 at 5:09 AM Georgi Nikolov <[email protected]> wrote: > Hello Tim, > > Thank you for the clarification. I have decided to stick to the pipes > endpoint for now, because I can use the json response from tika api to at > least detect when file X did not process properly. Will definitely explore > the Pipes reporter and the unit tests, if I have something worthy > documenting in the end I will give you a shout on here. > > Many thanks, > Georgi > > On Tue, 3 Oct 2023, 20:38 Tim Allison, <[email protected]> wrote: > >> I'm sorry for my delay. >> >> At some point, I was thinking about implementing: /async/<task_id> but I >> gave up. The problem was that I didn't want to have to tie caching/storing >> status info into tika-server or the async processor -- so I created a >> configurable PipesReporter class...see below. >> >> If you set up logging carefully, and I need to document this better, you >> can get a log per async subprocess which will include that "id" key and any >> assorted stacktraces that were caught during the parse. Those logs will >> not tell you when a subprocess timed out or crashed, but they do offer rich >> information. >> >> The other method (and I also have to document this better) is to specify >> a PipesReporter in the async config section of the tika-config.xml file. >> The pipes-reporter sits in the root process and is aware of both parse >> exceptions and fatal crashes/timeouts. I've implemented a couple: >> JDBCPipesReporter (which I used quite a bit on some large processing jobs >> with postgresql) and there's the LoggingPipesReporter. >> >> If you let me know which direction you'd like to head, I can focus on >> documenting that portion. :D >> >> Or if you figure out what you need from our repo's unit tests and want to >> document your findings, we'd be happy to have help with the documentation! >> >> Best, >> >> Tim >> >> >> >> On Thu, Sep 28, 2023 at 7:08 AM Georgi Nikolov <[email protected]> >> wrote: >> >>> Hello, >>> >>> I have been stuck on this for too long now I feel like, so I decided to >>> try to get some information here. >>> >>> I would need Tika to extract content and metadata from thousands of >>> files from S3. What I wanted to do is, have Tika running as a standalone >>> server and use S3 fetchers and emitters in conjunction with /async >>> >>> However I am having some difficulties to track what is going on the >>> server side, my client code is in python using `python-tika` >>> >>> A payload is built programmatically and sent to /async endpoint, but I >>> need to be able to track either the whole async task or individual tuples >>> from it - have they failed, succeeded, or still running but I am >>> struggling to achieve that, could not found any related information on >>> whether this is possible, came across some information that it can be >>> achieved by checking `/tika/async/<task_id> and that when you send a put >>> the response should contain X-Tika-id header, but none of these seem to >>> work. Additionally from confluence: >>> >>> As default the fetchKey is used as the id for logging. However, if >>>> users need a distinct task id for the request, they may add an id >>>> element: >>>> >>>> { >>>> "id": "myTaskId", >>>> "fetcher": "fsf", >>>> "fetchKey": "hello_world.pdf", >>>> "emitter": "fse", >>>> "emitKey": "hello_world.pdf.json" >>>> } >>>> >>>> >>>> >>> Is there a way to track the task id when running from /async looks like >>> there is from all that I have seen so far but can't seem to figure out how >>> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing >>> happens - no resource, if i try to use /tika/async/<task_id> I get a 405. >>> >>> I have tried using /pipes which would capture errors etc and is handy >>> but what about async ? >>> /async doesn't seem to throw any errors no matter what actually happens >>> in tika as long the payload is valid e.g processing errors, or bad cred >>> errors for aws everything just gets skipped. >>> >>> Any pointers in the right direction will be welcome. >>> >>> Thanks, >>> Georgi >>> >>
