BUILD FAILURE: Jackrabbit Oak - Build # 2341 - Still Failing

2019-08-23 Thread Apache Jenkins Server
The Apache Jenkins build system has built Jackrabbit Oak (build #2341)

Status: Still Failing

Check console output at https://builds.apache.org/job/Jackrabbit%20Oak/2341/ to 
view the results.

Changes:
[mattryan] OAK-8298: Add tracking of blob ids added via direct upload

 

Test results:
All tests passed<>


Intent to backport OAK-8298 to 1.10

2019-08-23 Thread Matt Ryan
Hi,

I propose to backport the fix to OAK-8298 to 1.10.  This is a bug fix for
direct binary access to ensure that binaries added via direct upload are
also tracked via the blob id tracker.
The fix is low risk in my view.


-MR


Re: About text extraction for index

2019-08-23 Thread Vikas Saurabh
>  but I am having a problem: the thread that processes the pdf file keeps
running, creating images and performing OCR. Is this supposed to happen?

TL;DR: yes, because there is no safe way to kill a thread

Yes that's supposed to happen. The reason this feature implemented was
because in most cases text extraction should finish within a reasonable
time. But, at times, due to a bad file or a bug in parser the extraction
process keeps on running - that used to hold up indexing for the whole
setup. Since the assumption with a timed out extraction is that tika or
whichever parser is in play might be stuck and Thread.stop could leave
things in incorrect state potentially affecting subsequent operations.

-Vikas
(sent from mobile)


Re: About text extraction for index

2019-08-23 Thread jorgeeflorez .
Hi,

I increased the maximum time (I set 300) for the text extraction and tested
it using a pdf file with many pages. I get the timeout in the log in the
expected time:
2019-08-23 09:02:38,380 DEBUG
[org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor]
(async-index-update-async) Extracting
/repo1/Carpeta1/File1/jcr:content@jcr:data,
4332681 bytes
2019-08-23 09:07:38,389 WARN
 
[org.apache.jackrabbit.oak.plugins.index.search.spi.binary.FulltextBinaryTextExtractor]
(async-index-update-async) [/oak:index/LuceneFullText] Failed to extract
text from a binary property due to timeout:
/repo1/Carpeta1/File1/jcr:content@jcr:data.

but I am having a problem: the thread that processes the pdf file keeps
running, creating images and performing OCR. Is this supposed to happen?
Should I check for something in that thread? (BTW, my application server is
wildfly 10, I don't know if that affects).

I will try again with oak.extraction.inCallerThread=true to see what
happens.

Regards,

Jorge Flórez

El vie., 23 ago. 2019 a las 7:13, jorgeeflorez . (<
jorgeeduardoflo...@gmail.com>) escribió:

> Hi Vikas,
>
> thank you for your reply. I will try to change those parameters and see
> what happens.
> To answer one of my questions, I found that text is extracted only from
> pdf if I add application/pdf to DefaultParser in the index
> Tika config file.
>
> Regards.
> Jorge Flórez
>
>
> El jue., 22 ago. 2019 a las 12:43, Vikas Saurabh ()
> escribió:
>
>> Hi,
>>
>> > Is it possible to change the maximum time for that text extraction
>>
>> You should be able to configure timeout by setting
>> -Doak.extraction.timeoutSeconds=120
>> [0] on ivm command line.
>>
>> Alternatively, you could also disable running in different thread by
>> setting -Doak.extraction.inCallerThread=true
>>
>> Hope that helps.
>>
>> [0]:
>>
>> http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/ExtractedTextCache.java?view=markup=1814745#l61
>>
>> --Vikas
>> (sent from mobile)
>>
>


Re: About text extraction for index

2019-08-23 Thread jorgeeflorez .
Hi Vikas,

thank you for your reply. I will try to change those parameters and see
what happens.
To answer one of my questions, I found that text is extracted only from pdf
if I add application/pdf to DefaultParser in the index Tika
config file.

Regards.
Jorge Flórez


El jue., 22 ago. 2019 a las 12:43, Vikas Saurabh ()
escribió:

> Hi,
>
> > Is it possible to change the maximum time for that text extraction
>
> You should be able to configure timeout by setting
> -Doak.extraction.timeoutSeconds=120
> [0] on ivm command line.
>
> Alternatively, you could also disable running in different thread by
> setting -Doak.extraction.inCallerThread=true
>
> Hope that helps.
>
> [0]:
>
> http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/ExtractedTextCache.java?view=markup=1814745#l61
>
> --Vikas
> (sent from mobile)
>


Re: Oak 1.8.16 release plan

2019-08-23 Thread Julian Reschke

On 23.08.2019 06:55, Nitin Gupta wrote:

Hello Team,



I am planning to cut 1.8.16 for oak on Monday (26th Aug) or Tuesday(27th
Aug) depending on my availability .

This is the only issue in Progress for 1.8.16 as of now -
https://issues.apache.org/jira/browse/OAK-8560   .
...


Yes. We either need to wait for jackson databind 2.10.0, or minimally
use the latest patch release of -databind (2.9.9.3).

Best regards, Julian