I used a recent tika.jar on the Windows 10 commandline to extract text from some 30 PDF files, with a makefile converting one file per command. That was quite successful, but it took some time, and the approach will perhaps not be appropriate for 300 or 1000 PDFs.
The tika.jar has >54 MB, and I suspect that the loading of the big jar (under Windows) is hindering the performance. I should perhaps move to Linux, or try the Tika server. But not only because of the performance I would like to have a tiny Tika subset without autodetection, converting (for example) only PDF or EPUB to XML or text, which should be much smaller and thereby faster loading (?). I imagine various uses of the Tika converters, but always in a very specialized context. I'm a new user, and I did not build tika myself so far. Is it possible to build such subset versions easily, and if so, how would you advise me to proceed? Regards - Georg
