Hi,

Beside extracting textual content from every parsed file, I have a need to simultaneously collect raw bytes for every parsed file that is of certain predefined content type. I use recursive parser so a parsed file can be one of "contained" ones that Tika extracted from some big archives, so I want to utilize this process of Tika parsing/extraction to simultaneously fetch these raw bytes for every individual file that was extracted.

Collection of raw bytes is done via special "ByteCollectingInputStream" that is consumed by Tika parsers, which is kind of FilterInputStream that collects raw bytes into internal array (acting as storage) before passing these bytes to its decorated InputStream.

Now one can ask what's the problem then?

Well, the problem is that I don't need to collect raw content of every possible file type, just some predefined file types. And some parsed files can be veeeeeery large, like some big archives, and I don't want to collect these raw bytes for such files (memory issue). But problem is that general Tika API offers file's "content-type" as part of Metadata that is populated *only after Parser.parse has finished*, but that is unacceptable for me because content bytes are already collected by then, so I want my "ByteCollectingInputStream" to know as soon as possible if the bytes that are streaming through it are of required file type, and stop collecting them if it finds out it is not the case.

I assume that means I have to provide initially empty Metadata in constructor of my special "ByteCollectingInputStream" before calling Parser.parse, and somehow allow few bytes to pass through, just enough for downstream parsers to figure out "content-type" field, and then continue collecting the bytes if cotnent-type is of desired value, but I am unsure how can I implement this and how to know how much bytes is enough for Metadata to be populated.

Any suggestion for implementations are welcome.

Regards,
Vjeran

Reply via email to