Re: Tika server mode - /async task tracking

Georgi Nikolov Fri, 20 Oct 2023 03:47:45 -0700

Hello,

I have been struggling a bit with setting reporters, I suspect there are
code issues or me misunderstanding something. Among other things I use S3
fetchers and emitters but there seems to be no reporter implementation for
them (or at least I couldn't find any).


The following two fail with different exceptions:
<async>
<params>
<forkedJvmArgs>
<arg>{XMX}</arg>
<arg>-Dlog4j.configurationFile={PIPE_PARAMS_LOG_FILE}</arg>
<arg>-cp</arg>
</forkedJvmArgs>
</params>
<pipesReporter class="org.apache.tika.pipes.CompositePipesReporter">
<params>
<pipesReporters class="org.apache.tika.pipes.PipesReporter">
<pipesReporter class
="org.apache.tika.pipes.reporters.fs.FileSystemStatusReporter">
<params>
<statusFile>{STATUS_FILE}</statusFile>
<reportUpdateMillis>1000</reportUpdateMillis>
</params>
</pipesReporter>
</pipesReporters>
</params>
</pipesReporter>
</async>

Would fail with:
apache-tika-server  | ERROR [main] 15:46:37,536
org.apache.tika.server.core.TikaServerProcess Can't start:
apache-tika-server  | org.apache.tika.exception.TikaConfigException:
pipesReporters with class name org.apache.tika.pipes.PipesReporter must be
of type 'java.util.List'


Then if I try the exact same configuration but for pipes:

<pipes>
<params>
<forkedJvmArgs>
<arg>{XMX}</arg>
<arg>-Dlog4j.configurationFile={PIPE_PARAMS_LOG_FILE}</arg>
<arg>-cp</arg>
</forkedJvmArgs>
</params>
<pipesReporter class="org.apache.tika.pipes.CompositePipesReporter">
<params>
<pipesReporters class="org.apache.tika.pipes.PipesReporter">
<pipesReporter class
="org.apache.tika.pipes.reporters.fs.FileSystemStatusReporter">
<!-- TODO make this a composite reporter and add emitter specific reporters
-->
<params>
<statusFile>{STATUS_FILE}</statusFile>
<reportUpdateMillis>1000</reportUpdateMillis>
</params>
</pipesReporter>
</pipesReporters>
</params>
</pipesReporter>
</pipes>


apache-tika-server  | WARN  [main] 15:55:55,260
org.apache.tika.pipes.fetcher.fs.FileSystemFetcher 'basePath' has not been
set. This means that client code or clients can read from any file that
this process has permissions to read. If you are running tika-server, make
absolutely certain that you've locked down access to tika-server and
file-permissions for the tika-server process.
apache-tika-server  | ERROR [main] 15:55:55,314
org.apache.tika.server.core.TikaServerProcess Can't start:
apache-tika-server  | org.apache.tika.exception.TikaConfigException:
Couldn't find setter 'setPipesReporter' or adder 'addPipesReporter' for
pipesReporter of class: class org.apache.tika.pipes.PipesConfig


My next logical question is since I want to explicitly use s3 fetchers and
emitters, should I even try using pipe reporters with them if they are not
implemented, would default pipes reporters produce any valuable outputs,
sorry if what I am saying makes no sense but none of this is documented.

In terms of functionality, I want to be able to detect errors when running
tika in server mode from the application which is calling the `/pipes/`
endpoint, which is configured to use fetchers and emitters, presumably if
any failure occurs the status code would be non 200

I would be more than happy to contribute but I am time limited at the
moment, it would be fun to have a go at java after so many years, obviously
when I get out of the way with what I am currently working on I would have
more time to address the issues in this email.

On Thu, Sep 28, 2023 at 12:07 PM Georgi Nikolov <[email protected]>
wrote:

> Hello,
>
> I have been stuck on this for too long now I feel like, so I decided to
> try to get some information here.
>
> I would need Tika to extract content and metadata from thousands of files
> from S3. What I wanted to do is, have Tika running as a standalone server
> and use S3 fetchers and emitters in conjunction with  /async
>
> However I am having some difficulties to track what is going on the server
> side, my client code is in python using `python-tika`
>
> A payload is built programmatically and sent to /async endpoint, but I
> need to be able to track either the whole async task or individual tuples
> from it - have they failed, succeeded, or still running but I am
> struggling to achieve that, could not found any related information on
> whether this is possible, came across some information that it can be
> achieved by checking `/tika/async/<task_id>  and that when you send a put
> the response should contain X-Tika-id header, but none of these seem to
> work. Additionally from confluence:
>
> As default the fetchKey is used as the id for logging.  However, if users
>> need a distinct task id for the request, they may add an id element:
>>
>> {
>>     "id": "myTaskId",
>>     "fetcher": "fsf",
>>     "fetchKey": "hello_world.pdf",
>>     "emitter": "fse",
>>     "emitKey": "hello_world.pdf.json"
>> }
>>
>>
>>
> Is there a way to track the task id when running from /async looks like
> there is from all that I have seen so far but can't seem to figure out how
> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing
> happens - no resource, if i try to use /tika/async/<task_id> I get a 405.
>
> I have tried using /pipes which would capture errors etc and is handy but
> what about async ?
> /async doesn't seem to throw any errors no matter what actually happens in
> tika as long the payload is valid e.g processing errors, or bad cred errors
> for aws everything just gets skipped.
>
> Any pointers in the right direction will be welcome.
>
> Thanks,
> Georgi
>

Re: Tika server mode - /async task tracking

Reply via email to