BUILD FAILURE: Jackrabbit Oak - Build # 2341 - Still Failing
The Apache Jenkins build system has built Jackrabbit Oak (build #2341) Status: Still Failing Check console output at https://builds.apache.org/job/Jackrabbit%20Oak/2341/ to view the results. Changes: [mattryan] OAK-8298: Add tracking of blob ids added via direct upload Test results: All tests passed<>
Intent to backport OAK-8298 to 1.10
Hi, I propose to backport the fix to OAK-8298 to 1.10. This is a bug fix for direct binary access to ensure that binaries added via direct upload are also tracked via the blob id tracker. The fix is low risk in my view. -MR
Re: About text extraction for index
> but I am having a problem: the thread that processes the pdf file keeps running, creating images and performing OCR. Is this supposed to happen? TL;DR: yes, because there is no safe way to kill a thread Yes that's supposed to happen. The reason this feature implemented was because in most cases text extraction should finish within a reasonable time. But, at times, due to a bad file or a bug in parser the extraction process keeps on running - that used to hold up indexing for the whole setup. Since the assumption with a timed out extraction is that tika or whichever parser is in play might be stuck and Thread.stop could leave things in incorrect state potentially affecting subsequent operations. -Vikas (sent from mobile)
Re: About text extraction for index
Hi, I increased the maximum time (I set 300) for the text extraction and tested it using a pdf file with many pages. I get the timeout in the log in the expected time: 2019-08-23 09:02:38,380 DEBUG [org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor] (async-index-update-async) Extracting /repo1/Carpeta1/File1/jcr:content@jcr:data, 4332681 bytes 2019-08-23 09:07:38,389 WARN [org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor] (async-index-update-async) [/oak:index/LuceneFullText] Failed to extract text from a binary property due to timeout: /repo1/Carpeta1/File1/jcr:content@jcr:data. but I am having a problem: the thread that processes the pdf file keeps running, creating images and performing OCR. Is this supposed to happen? Should I check for something in that thread? (BTW, my application server is wildfly 10, I don't know if that affects). I will try again with oak.extraction.inCallerThread=true to see what happens. Regards, Jorge Flórez El vie., 23 ago. 2019 a las 7:13, jorgeeflorez . (< jorgeeduardoflo...@gmail.com>) escribió: > Hi Vikas, > > thank you for your reply. I will try to change those parameters and see > what happens. > To answer one of my questions, I found that text is extracted only from > pdf if I add application/pdf to DefaultParser in the index > Tika config file. > > Regards. > Jorge Flórez > > > El jue., 22 ago. 2019 a las 12:43, Vikas Saurabh () > escribió: > >> Hi, >> >> > Is it possible to change the maximum time for that text extraction >> >> You should be able to configure timeout by setting >> -Doak.extraction.timeoutSeconds=120 >> [0] on ivm command line. >> >> Alternatively, you could also disable running in different thread by >> setting -Doak.extraction.inCallerThread=true >> >> Hope that helps. >> >> [0]: >> >> http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/ExtractedTextCache.java?view=markup=1814745#l61 >> >> --Vikas >> (sent from mobile) >> >
Re: About text extraction for index
Hi Vikas, thank you for your reply. I will try to change those parameters and see what happens. To answer one of my questions, I found that text is extracted only from pdf if I add application/pdf to DefaultParser in the index Tika config file. Regards. Jorge Flórez El jue., 22 ago. 2019 a las 12:43, Vikas Saurabh () escribió: > Hi, > > > Is it possible to change the maximum time for that text extraction > > You should be able to configure timeout by setting > -Doak.extraction.timeoutSeconds=120 > [0] on ivm command line. > > Alternatively, you could also disable running in different thread by > setting -Doak.extraction.inCallerThread=true > > Hope that helps. > > [0]: > > http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/ExtractedTextCache.java?view=markup=1814745#l61 > > --Vikas > (sent from mobile) >
Re: Oak 1.8.16 release plan
On 23.08.2019 06:55, Nitin Gupta wrote: Hello Team, I am planning to cut 1.8.16 for oak on Monday (26th Aug) or Tuesday(27th Aug) depending on my availability . This is the only issue in Progress for 1.8.16 as of now - https://issues.apache.org/jira/browse/OAK-8560 . ... Yes. We either need to wait for jackson databind 2.10.0, or minimally use the latest patch release of -databind (2.9.9.3). Best regards, Julian