On Sun, 18 Oct 2015, Vjeran Marcinko wrote:
Well, the problem is that I don't need to collect raw content of every possible file type, just some predefined file types. And some parsed files can be veeeeeery large, like some big archives, and I don't want to collect these raw bytes for such files (memory issue). But problem is that general Tika API offers file's "content-type" as part of Metadata that is populated *only after Parser.parse has finished*

Why not do detection first? Then wrap if needed, then set the result of detection onto the Metadata object, then finally call DefaultParser. It's basically what AutoDetectParser does internally, but this way you get to change your logic post-detection and pre-parsing

Nick

Reply via email to