Hello Jukka, This is a customer's output by WYSIWYG, and is an error in my opinion that it generated this deeply nested structure. So no, this is not a valid document, although 100 could well be valid for real documents, but i have never seen that before, and i have seen thousands of unique sites.
I think 100 is fine, i just looked for a way to work around it by configuration, without bothering the customer's customer with it. Thanks, Markus -----Original message----- > From:Jukka Zitting <[email protected]> > Sent: Monday 26th August 2019 19:48 > To: Tika Users <[email protected]>; [email protected] > Subject: Re: How to increase ZIP bomb maximum depth > > Hi, > > I wonder if we should just increase the default thresholds to allow deeper > nesting before the exception gets thrown. The defaults should be tuned to > make the false-positive rate as low as possible without opening the door for > false negatives that could result denial of service attacks. > > The package-entry depth limit added in > https://issues.apache.org/jira/browse/TIKA-741 > <https://issues.apache.org/jira/browse/TIKA-741> should make it OK to > increase the default maxDepth from 100 to say 200 if people are hitting this > limit with valid documents. > > Markus, what kind of documents are triggering the exception for you? What > would be a good maxDepth setting for your case? > > Best, > > Jukka > > > On Mon, Aug 26, 2019 at 1:40 PM Tim Allison <[email protected] > <mailto:[email protected]>> wrote: > Oh, ok. This is helpful. Got it. The AutoDetectParser automatically > wraps the incoming handler in a SecureContentHandler. Some options... > > 1) We could have the AutoDetectParser skip wrapping a > SecureContentHandler around the incoming handler if the user calls > parse with a SecureContentHandler... > 2) We could add SecureContentHandler parameter settings to the > AutoDetectParser, and it would configure the SecureContentHandler > accordingly...I think there are a few subtleties, but this might get > you configurability via tika-config.xml. > > Im not offering static thresholds on the SecureContentHandler. :D > > Fellow devs, how else might we make this work and make it configurable > via tika-config.xml? > > Cheers, > > Tim > > > On Mon, Aug 26, 2019 at 1:24 PM Markus Jelsma > <[email protected] <mailto:[email protected]>> wrote: > > > > Hello Tim, > > > > I use Tika embedded in another Java application. passing it a custom > > ContentHandler which collects interesting stuff, which we, after the parse, > > use to construct meaningful text. > > > > ReadableContentHandler handler = new ReadableContentHandler(url, > >config); > > > > AutoDetectParser parser = new AutoDetectParser(tikaConfig); > > parser.parse(stream, handler, new Metadata(), context); > > > > My ContentHandler does not extend SecureContentHandler so i never have a > > chance to pass some different value for the nesting limit check. > > > > Many thanks, > > Markus > > > > -----Original message----- > > > From:Tim Allison <[email protected] <mailto:[email protected]>> > > > Sent: Monday 26th August 2019 19:11 > > > To: [email protected] <mailto:[email protected]> > > > Subject: Re: How to increase ZIP bomb maximum depth > > > > > > Hi Markus, > > > > > > This requires some work...the zip bomb protections are currently > > > handled by the handler. We allow for configuration of the parsers, > > > detectors, charset detectors, but not yet the handlers. IIRC, weve > > > talked a bit about specifying a custom handler via the commandline at > > > least in tika-server. I wonder if we should allow for a default > > > handler configuration that would specify a handler to be used by the > > > facade Tika.parse(inputStream)? > > > > > > Fellow devs have any recommendations? > > > > > > How are you currently calling Tika? Via tika-server, Solrs DIH or > > > something else? > > > > > > Best, > > > > > > Tim > > > > > > On Mon, Aug 26, 2019 at 11:20 AM Markus Jelsma > > > <[email protected] <mailto:[email protected]>> wrote: > > > > > > > > Hello, > > > > > > > > Ive been looking around to increase the limit, but i dont seem to be > > > > able to find how. I know there the setter for it, but using > > > > AutoDetectParser, id like to set it via tika-config. I havent seen a > > > > parameter for tika-config that would set that value and the manual on > > > > Configuring Tika doesnt mention it. > > > > > > > > Many thanks, > > > > Markus > > > > > > > > > > >
