I used a recent tika.jar on the Windows 10
commandline to extract text from some 30 PDF files,
with a makefile converting one file per command.
That was quite successful, but it took some time,
and the approach will perhaps not be appropriate
for 300 or 1000 PDFs.

The tika.jar has >54 MB, and I suspect that the
loading of the big jar (under Windows) is hindering
the performance. I should perhaps move to Linux, or
try the Tika server.

But not only because of the performance I would like
to have a tiny Tika subset without autodetection,
converting (for example) only PDF or EPUB to XML or text,
which should be much smaller and thereby faster loading (?).

I imagine various uses of the Tika converters,
but always in a very specialized context.

I'm a new user, and I did not build tika myself so far.
Is it possible to build such subset versions easily,
and if so, how would you advise me to proceed?

Regards - Georg

Reply via email to