Re: About text extraction for index

2019-08-23 Thread Vikas Saurabh
> but I am having a problem: the thread that processes the pdf file keeps running, creating images and performing OCR. Is this supposed to happen? TL;DR: yes, because there is no safe way to kill a thread Yes that's supposed to happen. The reason this feature implemented was because in most

Re: About text extraction for index

2019-08-23 Thread jorgeeflorez .
Hi, I increased the maximum time (I set 300) for the text extraction and tested it using a pdf file with many pages. I get the timeout in the log in the expected time: 2019-08-23 09:02:38,380 DEBUG [org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor]

Re: About text extraction for index

2019-08-23 Thread jorgeeflorez .
Hi Vikas, thank you for your reply. I will try to change those parameters and see what happens. To answer one of my questions, I found that text is extracted only from pdf if I add application/pdf to DefaultParser in the index Tika config file. Regards. Jorge Flórez El jue., 22 ago. 2019 a las

Re: About text extraction for index

2019-08-22 Thread Vikas Saurabh
Hi, > Is it possible to change the maximum time for that text extraction You should be able to configure timeout by setting -Doak.extraction.timeoutSeconds=120 [0] on ivm command line. Alternatively, you could also disable running in different thread by setting

About text extraction for index

2019-08-22 Thread jorgeeflorez .
Hi all, I have a question regarding text extraction when nt:file nodes are indexed (I am using oak 1.12.0 and tika-parsers 1.20). Is the text contained in a pdf file I attach to a file node extracted and included in the index by default (when using the default tika config)? Or should I explicitly