Hi Nick and Georg On Thu, Jan 5, 2023 at 9:34 AM Nick Burch <[email protected]> wrote:
> On Thu, 5 Jan 2023, Georg.Fischer wrote: > > The tika.jar has >54 MB, and I suspect that the loading of the big jar > > (under Windows) is hindering the performance. I should perhaps move to > > Linux, or try the Tika server. > > The Tika App jar has always been the "kitchen sink included quickstart" > option > > The Tika java library, and the Tika Server both support including or > excluding groups of file format parsers > > > I used a recent tika.jar on the Windows 10 commandline to extract text > > from some 30 PDF files, with a makefile converting one file per command. > > That was quite successful, but it took some time, and the approach will > > perhaps not be appropriate for 300 or 1000 PDFs. > > For a folder of files, you might be better off with Tika Batch, which is > aimed at batch processing a large number of files. It can respawn failed > child processes, doesn't require starting a JVM every file etc > > Otherwise, the Tika Server is a good option. If you're doing everything > locally, turn on "-enableUnsecureFeatures -enableFileUrl" and then you can > pass it a file path to process (but not on a publically available > machine!) > > Now that's a neat trick - I was just going to suggest the Server but those switches are definitely something to add to my notes. Also, thanks for suggesting Tika Batch - I didn't know about that either. > Nick > Best, Bridger
