RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Not sure what to do with this one.

The triggering document has a run of ~50  starts and then ~50+  
starts.  So, y, Tika limits nested elements to 100.

Tika's DefaultHtmlMapper only passes through a few handfuls of elements 
(SAFE_ELEMENTS), not including  or . 

Solr's MostlyPassThroughHtmlMapper passes through, well, mostly everything.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, September 22, 2016 12:47 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Disabling Zip bomb detection in Tika

So far a Tika JIRA seems like the right thing. Tim is "a well known entity"
in Solr though so I'm sure he'll move it over to Solr if appropriate.

Erick

On Thu, Sep 22, 2016 at 9:43 AM, Rodrigo Rosenfeld Rosas 
<rr_ro...@yahoo.com.br.invalid> wrote:
> Here it is. Not sure if it's clear enough though:
>
> https://issues.apache.org/jira/browse/TIKA-2091
>
> Or should I have created the ticket in the Solr project instead?
>
>
> Em 22-09-2016 13:32, Rodrigo Rosenfeld Rosas escreveu:
>>
>> This is one of the documents:
>>
>>
>> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e6
>> 11133_f6ef-eutelsat.htm
>>
>> I'll try to create a ticket for this on Jira if I find its location 
>> but feel free to open it yourself if you prefer, just let me know.
>>
>> Em 22-09-2016 12:33, Allison, Timothy B. escreveu:
>>>>
>>>> I'll try to get a sample HTML yielding to this problem and attach 
>>>> it to Jira.
>>>
>>> Great!  Tika 1.14 is around the corner...if this is an easy fix ... 
>>> :)
>>>
>>> Thank you.
>>>
>>
>>
>


Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Erick Erickson
So far a Tika JIRA seems like the right thing. Tim is "a well known entity"
in Solr though so I'm sure he'll move it over to Solr if appropriate.

Erick

On Thu, Sep 22, 2016 at 9:43 AM, Rodrigo Rosenfeld Rosas
 wrote:
> Here it is. Not sure if it's clear enough though:
>
> https://issues.apache.org/jira/browse/TIKA-2091
>
> Or should I have created the ticket in the Solr project instead?
>
>
> Em 22-09-2016 13:32, Rodrigo Rosenfeld Rosas escreveu:
>>
>> This is one of the documents:
>>
>>
>> https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm
>>
>> I'll try to create a ticket for this on Jira if I find its location but
>> feel free to open it yourself if you prefer, just let me know.
>>
>> Em 22-09-2016 12:33, Allison, Timothy B. escreveu:

 I'll try to get a sample HTML yielding to this problem and attach it to
 Jira.
>>>
>>> Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)
>>>
>>> Thank you.
>>>
>>
>>
>


Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Rodrigo Rosenfeld Rosas

Here it is. Not sure if it's clear enough though:

https://issues.apache.org/jira/browse/TIKA-2091

Or should I have created the ticket in the Solr project instead?

Em 22-09-2016 13:32, Rodrigo Rosenfeld Rosas escreveu:

This is one of the documents:

https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm 



I'll try to create a ticket for this on Jira if I find its location 
but feel free to open it yourself if you prefer, just let me know.


Em 22-09-2016 12:33, Allison, Timothy B. escreveu:
I'll try to get a sample HTML yielding to this problem and attach it 
to Jira.

Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)

Thank you.








RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Tika might be overkill for you (no one can hear us, right?).  


One thing that Tika buys you is fairly smart encoding detection for html pages. 
 Looks like Nokogiri does do some kind of encoding detection, but it may only 
read the meta-headers.  I haven't used Nokogiri, but if you're happy with the 
results of that, go for it.


-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 12:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Disabling Zip bomb detection in Tika

Great, thanks for the URL, I'll check that.

I was wondering if maybe Tika would be an overkill solution to my specific 
case. We don't index PDF, DOC or anything like that, just plain HTML.

I mean, if everything Tika does is to extract text from HTML, maybe I could get 
the same result using Nokogiri directly in Ruby and send it as plain text to 
Solr? Am I missing something? What would Tika do besides extracting the text 
from the HTML?

Thanks in advance,
Rodrigo.

Em 22-09-2016 12:11, Erick Erickson escreveu:
> Tika was upgraded from 1.7 to 1.13 in Solr 6.2 so this is likely a 
> change in Tika.
>
> You could _try_ downgrading Tika, but that's chancy and I have no 
> guarantee that it'll work.
>
> Or use a SolrJ client to use an older version of Tika and transmit it 
> to Solr, here's an example:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Sep 22, 2016 at 8:01 AM, Rodrigo Rosenfeld Rosas 
> <rr_ro...@yahoo.com.br.invalid> wrote:
>> I forgot to mention that this problem just happened after I upgraded 
>> to a recent version of Solr and tried to reindex all documents. Some 
>> documents that had previously succeeded now failed with this error.
>>
>> Em 22-09-2016 11:58, Rodrigo Rosenfeld Rosas escreveu:
>>> Hi, thanks. I was talking to @elyograg over freenode#solr and he (or 
>>> she, can't know by the nickname) recommended me to create a Java app 
>>> integrating SolrJ and Tika to perform the indexing. Is this the only 
>>> way to achieve that with Solr? Since I'm not usually a Java 
>>> developer, I'd prefer another kind of solution, but if there isn't, 
>>> I'll have to look at the Java API and examples for SolrJ and Tika to 
>>> achieve that...
>>>
>>> Just wanted to confirm. I'll try to get a sample HTML yielding to 
>>> this problem and attach it to Jira.
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>> Em 22-09-2016 11:48, Allison, Timothy B. escreveu:
>>>> Y, looks like Nick (gagravarr) has answered on SO -- can't do it in 
>>>> Tika currently.
>>>>
>>>> -Original Message-
>>>> From: Allison, Timothy B. [mailto:talli...@mitre.org]
>>>> Sent: Thursday, September 22, 2016 10:42 AM
>>>> To: solr-user@lucene.apache.org
>>>> Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
>>>> Subject: RE: Disabling Zip bomb detection in Tika
>>>>
>>>> I don't think that's configurable at the moment.
>>>>
>>>> Tika-colleagues, any recommendations?
>>>>
>>>> If you're able to share the file on Tika's jira, we'd be happy to 
>>>> take a look.  You shouldn't be getting the zip bomb unless there is 
>>>> a mismatch between opening and closing tags (which could point to a bug in 
>>>> Tika).
>>>>
>>>> -Original Message-
>>>> From: Rodrigo Rosenfeld Rosas 
>>>> [mailto:rr_ro...@yahoo.com.br.INVALID]
>>>> Sent: Thursday, September 22, 2016 10:06 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Disabling Zip bomb detection in Tika
>>>>
>>>> Hi, this is my first message in this list.
>>>>
>>>> Is it possible to disable Zip bomb detection in the Tika handler?
>>>>
>>>> I've also described the problem here:
>>>>
>>>>
>>>> http://stackoverflow.com/questions/39628519/how-to-disable-or-incre
>>>> ase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#
>>>> comment66575342_39628519
>>>>
>>>> Basically, I get this error when trying to process some big valid 
>>>> HTML
>>>> documents:
>>>>
>>>> RSolr::Error::Http - 500 Internal Server Error
>>>> Error:
>>>>
>>>> {'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache

Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Rodrigo Rosenfeld Rosas

This is one of the documents:

https://www.sec.gov/Archives/edgar/data/1472033/000119380513001310/e611133_f6ef-eutelsat.htm

I'll try to create a ticket for this on Jira if I find its location but 
feel free to open it yourself if you prefer, just let me know.


Em 22-09-2016 12:33, Allison, Timothy B. escreveu:

I'll try to get a sample HTML yielding to this problem and attach it to Jira.

Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)

Thank you.





Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Rodrigo Rosenfeld Rosas

Great, thanks for the URL, I'll check that.

I was wondering if maybe Tika would be an overkill solution to my 
specific case. We don't index PDF, DOC or anything like that, just plain 
HTML.


I mean, if everything Tika does is to extract text from HTML, maybe I 
could get the same result using Nokogiri directly in Ruby and send it as 
plain text to Solr? Am I missing something? What would Tika do besides 
extracting the text from the HTML?


Thanks in advance,
Rodrigo.

Em 22-09-2016 12:11, Erick Erickson escreveu:

Tika was upgraded from 1.7 to 1.13 in Solr 6.2 so this is likely a
change in Tika.

You could _try_ downgrading Tika, but that's chancy and I have no guarantee
that it'll work.

Or use a SolrJ client to use an older version of Tika and transmit it
to Solr, here's
an example:

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Sep 22, 2016 at 8:01 AM, Rodrigo Rosenfeld Rosas
<rr_ro...@yahoo.com.br.invalid> wrote:

I forgot to mention that this problem just happened after I upgraded to a
recent version of Solr and tried to reindex all documents. Some documents
that had previously succeeded now failed with this error.

Em 22-09-2016 11:58, Rodrigo Rosenfeld Rosas escreveu:

Hi, thanks. I was talking to @elyograg over freenode#solr and he (or she,
can't know by the nickname) recommended me to create a Java app integrating
SolrJ and Tika to perform the indexing. Is this the only way to achieve that
with Solr? Since I'm not usually a Java developer, I'd prefer another kind
of solution, but if there isn't, I'll have to look at the Java API and
examples for SolrJ and Tika to achieve that...

Just wanted to confirm. I'll try to get a sample HTML yielding to this
problem and attach it to Jira.

Thanks,
Rodrigo.

Em 22-09-2016 11:48, Allison, Timothy B. escreveu:

Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika
currently.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
Subject: RE: Disabling Zip bomb detection in Tika

I don't think that's configurable at the moment.

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a
look.  You shouldn't be getting the zip bomb unless there is a mismatch
between opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID]
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:


http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error:

{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Zip bomb detected!
   at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
   at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
   at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
   at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
   at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
   at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
   at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb
detection or to increase the limit using configuration files? I noticed it's
possible to add a tika.config file but I have no idea on how to specify what
I want in such Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.








RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.

Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)

Thank you.



Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Erick Erickson
Tika was upgraded from 1.7 to 1.13 in Solr 6.2 so this is likely a
change in Tika.

You could _try_ downgrading Tika, but that's chancy and I have no guarantee
that it'll work.

Or use a SolrJ client to use an older version of Tika and transmit it
to Solr, here's
an example:

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Thu, Sep 22, 2016 at 8:01 AM, Rodrigo Rosenfeld Rosas
<rr_ro...@yahoo.com.br.invalid> wrote:
> I forgot to mention that this problem just happened after I upgraded to a
> recent version of Solr and tried to reindex all documents. Some documents
> that had previously succeeded now failed with this error.
>
> Em 22-09-2016 11:58, Rodrigo Rosenfeld Rosas escreveu:
>>
>> Hi, thanks. I was talking to @elyograg over freenode#solr and he (or she,
>> can't know by the nickname) recommended me to create a Java app integrating
>> SolrJ and Tika to perform the indexing. Is this the only way to achieve that
>> with Solr? Since I'm not usually a Java developer, I'd prefer another kind
>> of solution, but if there isn't, I'll have to look at the Java API and
>> examples for SolrJ and Tika to achieve that...
>>
>> Just wanted to confirm. I'll try to get a sample HTML yielding to this
>> problem and attach it to Jira.
>>
>> Thanks,
>> Rodrigo.
>>
>> Em 22-09-2016 11:48, Allison, Timothy B. escreveu:
>>>
>>> Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika
>>> currently.
>>>
>>> -Original Message-
>>> From: Allison, Timothy B. [mailto:talli...@mitre.org]
>>> Sent: Thursday, September 22, 2016 10:42 AM
>>> To: solr-user@lucene.apache.org
>>> Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
>>> Subject: RE: Disabling Zip bomb detection in Tika
>>>
>>> I don't think that's configurable at the moment.
>>>
>>> Tika-colleagues, any recommendations?
>>>
>>> If you're able to share the file on Tika's jira, we'd be happy to take a
>>> look.  You shouldn't be getting the zip bomb unless there is a mismatch
>>> between opening and closing tags (which could point to a bug in Tika).
>>>
>>> -Original Message-
>>> From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID]
>>> Sent: Thursday, September 22, 2016 10:06 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Disabling Zip bomb detection in Tika
>>>
>>> Hi, this is my first message in this list.
>>>
>>> Is it possible to disable Zip bomb detection in the Tika handler?
>>>
>>> I've also described the problem here:
>>>
>>>
>>> http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519
>>>
>>> Basically, I get this error when trying to process some big valid HTML
>>> documents:
>>>
>>> RSolr::Error::Http - 500 Internal Server Error
>>> Error:
>>>
>>> {'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
>>> Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException:
>>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>>   at
>>>
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
>>>   at
>>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>>>   at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
>>>   at
>>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
>>>   at
>>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
>>>   at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
>>>   at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
>>>
>>> I need to index those documents. Is it possible to disable Zip bomb
>>> detection or to increase the limit using configuration files? I noticed it's
>>> possible to add a tika.config file but I have no idea on how to specify what
>>> I want in such Tika configuration files.
>>>
>>> Any help is appreciated!
>>>
>>> Thanks in advance,
>>> Rodrigo.
>>
>>
>>
>>
>


Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Rodrigo Rosenfeld Rosas
Hi, thanks. I was talking to @elyograg over freenode#solr and he (or 
she, can't know by the nickname) recommended me to create a Java app 
integrating SolrJ and Tika to perform the indexing. Is this the only way 
to achieve that with Solr? Since I'm not usually a Java developer, I'd 
prefer another kind of solution, but if there isn't, I'll have to look 
at the Java API and examples for SolrJ and Tika to achieve that...


Just wanted to confirm. I'll try to get a sample HTML yielding to this 
problem and attach it to Jira.


Thanks,
Rodrigo.

Em 22-09-2016 11:48, Allison, Timothy B. escreveu:

Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika 
currently.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
Subject: RE: Disabling Zip bomb detection in Tika

I don't think that's configurable at the moment.

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a look.  
You shouldn't be getting the zip bomb unless there is a mismatch between 
opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID]
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error:
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Zip bomb detected!
  at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
  at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
  at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
  at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb detection 
or to increase the limit using configuration files? I noticed it's possible to 
add a tika.config file but I have no idea on how to specify what I want in such 
Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.





Re: Disabling Zip bomb detection in Tika

2016-09-22 Thread Rodrigo Rosenfeld Rosas
I forgot to mention that this problem just happened after I upgraded to 
a recent version of Solr and tried to reindex all documents. Some 
documents that had previously succeeded now failed with this error.


Em 22-09-2016 11:58, Rodrigo Rosenfeld Rosas escreveu:
Hi, thanks. I was talking to @elyograg over freenode#solr and he (or 
she, can't know by the nickname) recommended me to create a Java app 
integrating SolrJ and Tika to perform the indexing. Is this the only 
way to achieve that with Solr? Since I'm not usually a Java developer, 
I'd prefer another kind of solution, but if there isn't, I'll have to 
look at the Java API and examples for SolrJ and Tika to achieve that...


Just wanted to confirm. I'll try to get a sample HTML yielding to this 
problem and attach it to Jira.


Thanks,
Rodrigo.

Em 22-09-2016 11:48, Allison, Timothy B. escreveu:
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in 
Tika currently.


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
Subject: RE: Disabling Zip bomb detection in Tika

I don't think that's configurable at the moment.

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to 
take a look.  You shouldn't be getting the zip bomb unless there is a 
mismatch between opening and closing tags (which could point to a bug 
in Tika).


-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID]
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519 



Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error:
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException: 


Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Zip bomb detected!
  at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) 


  at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) 


  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154) 


  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
  at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
  at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)

  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257) 


  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208) 



I need to index those documents. Is it possible to disable Zip bomb 
detection or to increase the limit using configuration files? I 
noticed it's possible to add a tika.config file but I have no idea on 
how to specify what I want in such Tika configuration files.


Any help is appreciated!

Thanks in advance,
Rodrigo.








RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika 
currently.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org' <u...@tika.apache.org>
Subject: RE: Disabling Zip bomb detection in Tika

I don't think that's configurable at the moment.  

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a look.  
You shouldn't be getting the zip bomb unless there is a mismatch between 
opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error: 
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
 
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
 at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb detection 
or to increase the limit using configuration files? I noticed it's possible to 
add a tika.config file but I have no idea on how to specify what I want in such 
Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.


RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
I don't think that's configurable at the moment.  

Tika-colleagues, any recommendations?

If you're able to share the file on Tika's jira, we'd be happy to take a look.  
You shouldn't be getting the zip bomb unless there is a mismatch between 
opening and closing tags (which could point to a bug in Tika).

-Original Message-
From: Rodrigo Rosenfeld Rosas [mailto:rr_ro...@yahoo.com.br.INVALID] 
Sent: Thursday, September 22, 2016 10:06 AM
To: solr-user@lucene.apache.org
Subject: Disabling Zip bomb detection in Tika

Hi, this is my first message in this list.

Is it possible to disable Zip bomb detection in the Tika handler?

I've also described the problem here:

http://stackoverflow.com/questions/39628519/how-to-disable-or-increase-limit-zip-bomb-detection-in-tika-with-solr-config?noredirect=1#comment66575342_39628519

Basically, I get this error when trying to process some big valid HTML
documents:

RSolr::Error::Http - 500 Internal Server Error
Error: 
{'responseHeader'=>{'status'=>500,'QTime'=>76},'error'=>{'metadata'=>['error-class','org.apache.solr.common.SolrException','root-error-class','org.apache.tika.sax.SecureContentHandler$SecureSAXException'],'msg'=>'org.apache.tika.exception.TikaException:
 
Zip bomb detected!','trace'=>'org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
 at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
 at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)
 at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)

I need to index those documents. Is it possible to disable Zip bomb detection 
or to increase the limit using configuration files? I noticed it's possible to 
add a tika.config file but I have no idea on how to specify what I want in such 
Tika configuration files.

Any help is appreciated!

Thanks in advance,
Rodrigo.