Help needed for special byte collecting input stream

Vjeran Marcinko Sun, 18 Oct 2015 13:08:20 -0700

Hi,

Beside extracting textual content from every parsed file, I have a needto simultaneously collect raw bytes for every parsed file that is ofcertain predefined content type. I use recursive parser so a parsed filecan be one of "contained" ones that Tika extracted from some bigarchives, so I want to utilize this process of Tika parsing/extractionto simultaneously fetch these raw bytes for every individual file thatwas extracted.

Collection of raw bytes is done via special "ByteCollectingInputStream"that is consumed by Tika parsers, which is kind of FilterInputStreamthat collects raw bytes into internal array (acting as storage) beforepassing these bytes to its decorated InputStream.


Now one can ask what's the problem then?

Well, the problem is that I don't need to collect raw content of everypossible file type, just some predefined file types. And some parsedfiles can be veeeeeery large, like some big archives, and I don't wantto collect these raw bytes for such files (memory issue). But problem isthat general Tika API offers file's "content-type" as part of Metadatathat is populated *only after Parser.parse has finished*, but that isunacceptable for me because content bytes are already collected by then,so I want my "ByteCollectingInputStream" to know as soon as possible ifthe bytes that are streaming through it are of required file type, andstop collecting them if it finds out it is not the case.

I assume that means I have to provide initially empty Metadata inconstructor of my special "ByteCollectingInputStream" before callingParser.parse, and somehow allow few bytes to pass through, just enoughfor downstream parsers to figure out "content-type" field, and thencontinue collecting the bytes if cotnent-type is of desired value, but Iam unsure how can I implement this and how to know how much bytes isenough for Metadata to be populated.


Any suggestion for implementations are welcome.

Regards,
Vjeran

Help needed for special byte collecting input stream

Reply via email to