Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Luís Filipe Nassif
Yes, tika-server is the long way choice, as discussed in user's list recent thread. I hope I will have time in the future to migrate to it to get rid of jar hell problems definitely... Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I created a

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Nicholas DiPiazza
I created a tika fork example I want to add to the documentation as well: https://github.com/nddipiazza/tika-fork-parser-example When we submit your fixes, we should update this example with multi-threading. On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote:

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Nicholas DiPiazza
Hey Luis, It is related because after your fixes I might be able to take some significant performance advantage by switching to fork parser. I would make great use of an example of someone else who has set up a ForkParser multi-thread able processing program that can gracefully handle the huge

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
Not what you asked but related :) Luis Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif escreveu: > I've done some few improvements in ForkParser performance in an internal > fork. Will try to contribute upstream... > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < >

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
I've done some few improvements in ForkParser performance in an internal fork. Will try to contribute upstream... Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I am attempting to Tika parse dozens of millions of office documents. Pdfs, > docs,

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-23 Thread Nicholas DiPiazza
I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time, accuracy is also pretty important. I hope to have