Take a look at the DigestingParser, which wraps another parser.

new DigestingParser(p, new CommonsDigester(100, 

If you need modifications, please open a ticket.

From: Wshrdryr Corp [mailto:wshrd...@gmail.com]
Sent: Wednesday, February 15, 2017 7:49 PM
To: user@tika.apache.org
Subject: Re: CRC ContentHandler

Hello Markus,

Thanks again for taking the time to reply.

I guess I should be more specific: I am extending a Nifi component to add this 
CRC calculation, specifically here:


Nifi uses Tika to parse files, but doesn't do anything with the content. My 
plan is to write a TIka content handler which calculates a CRC form the content 
as it parses the file in order to get a fingerprint of the data segment so the 
tags can be modified and my system can still prove it's the same underlying 

This is why I asked the original question in the way I did.

I see the definition of a Sax ContentHandler, but I was hoping to find a way to 
get the underlying stream. Or, if I am going about this the wrong way I'd 
appreciate any advice.


On Wed, Feb 15, 2017 at 4:03 PM, Markus Jelsma 
<markus.jel...@openindex.io<mailto:markus.jel...@openindex.io>> wrote:
Hello - streaming hash functions are, in general, from a cryptographic point of 
view a bad idea, but if you are just interested in checking data integrity it 
might work for you. You will either have to collect all bits of data and hash 
it in the end, or feed it to a hashing function that allows for streaming data. 
The algorithm is up to you.

But, on the other hand, are the files you receive that large? Does your process 
at some point buffer the entire file? If so, hashing is it easy. I don't know 
if Tika supports ingesting streaming data but in Apache Nutch we buffer the 
entire file at some point before sending it to Apache Tika, hashing the data 
is, in this case, not a problem.


-----Original message-----
> From:Wshrdryr Corp <wshrd...@gmail.com<mailto:wshrd...@gmail.com>>
> Sent: Thursday 16th February 2017 0:43
> To: user@tika.apache.org<mailto:user@tika.apache.org>
> Subject: Re: CRC ContentHandler
> Hello Markus,
> Thanks for replying.
> I was hoping not to have to buffer entire media files due to size. Is there a 
> way to get the content segment as a stream? The internal buffering of a 
> stream might be more efficient and less prone to spikes.
> Java is not my native tongue. Ive been able to hack through other API 
> challenges while doing this project. Googling has given me some suspicions 
> but not a clear answer.
> Cheers.
> On Wed, Feb 15, 2017 at 3:26 PM, Markus Jelsma 
> <markus.jel...@openindex.io<mailto:markus.jel...@openindex.io> 
> <mailto:markus.jel...@openindex.io<mailto:markus.jel...@openindex.io>>> wrote:
> Hello - i dont know if media files even produce SAX events, but if they do 
> you can catch them in your startElement, charachters, and endElement methods. 
> I would start collecting element names (qName and/or attribute values) and 
> stuff in the character method, and append those to a StringBuilder.


> In the endDocument method you have collected every piece of information the 
> ContentHandler method receives. From thereon you just call 
> toString().hashCode() or whatever hashing algorithm you like on the contents 
> accumulated in your StringBuilder.


> Regards,

> Markus




> -----Original message-----

> > From:Wshrdryr Corp <wshrd...@gmail.com<mailto:wshrd...@gmail.com> 
> > <mailto:wshrd...@gmail.com<mailto:wshrd...@gmail.com>>>

> > Sent: Wednesday 15th February 2017 23:22

> > To: user@tika.apache.org<mailto:user@tika.apache.org> 
> > <mailto:user@tika.apache.org<mailto:user@tika.apache.org>>

> > Subject: CRC ContentHandler

> >

> > Hello all,

> >

> > I need to write a Tika ContentHandler which will return a CRC and/or hash 
> > of the non-metadata part of media files.

> >

> > Can anyone point me in the right direction?

> >

> > Im new to Tika so please forgive me if this is an obvious question.

> >

> > TIA for any help.


Reply via email to