On Thu, 5 Jan 2023, Georg.Fischer wrote:
The tika.jar has >54 MB, and I suspect that the loading of the big jar (under Windows) is hindering the performance. I should perhaps move to Linux, or try the Tika server.
The Tika App jar has always been the "kitchen sink included quickstart" option
The Tika java library, and the Tika Server both support including or excluding groups of file format parsers
I used a recent tika.jar on the Windows 10 commandline to extract text from some 30 PDF files, with a makefile converting one file per command. That was quite successful, but it took some time, and the approach will perhaps not be appropriate for 300 or 1000 PDFs.
For a folder of files, you might be better off with Tika Batch, which is aimed at batch processing a large number of files. It can respawn failed child processes, doesn't require starting a JVM every file etc
Otherwise, the Tika Server is a good option. If you're doing everything locally, turn on "-enableUnsecureFeatures -enableFileUrl" and then you can pass it a file path to process (but not on a publically available machine!)
Nick
