Hello Tim, Thanks for the pointer. I've been trying to figure out how to frame a ticket for a theoretical CRCDigestingParser while looking at my project spec. I foresee a need for a few kinds of as-simultaneous-as-can-be content analysis on the same file, so I think my actual philosophical ticket is TIKA-1509 but until that happy day
TIKA-2272 opened On Thu, Feb 16, 2017 at 3:54 AM, Allison, Timothy B. <talli...@mitre.org> wrote: > Take a look at the DigestingParser, which wraps another parser. > > > > *new *DigestingParser(*p*, *new *CommonsDigester(100, CommonsDigester. > DigestAlgorithm.*MD5*)) > > > > If you need modifications, please open a ticket. > > > > *From:* Wshrdryr Corp [mailto:wshrd...@gmail.com] > *Sent:* Wednesday, February 15, 2017 7:49 PM > > *To:* user@tika.apache.org > *Subject:* Re: CRC ContentHandler > > > > Hello Markus, > > > > Thanks again for taking the time to reply. > > > > I guess I should be more specific: I am extending a Nifi component to add > this CRC calculation, specifically here: > > > > https://github.com/apache/nifi/blob/0.x/nifi-nar- > bundles/nifi-media-bundle/nifi-media-processors/src/ > main/java/org/apache/nifi/processors/media/ExtractMediaMetadata.java#L213 > > > > Nifi uses Tika to parse files, but doesn't do anything with the content. > My plan is to write a TIka content handler which calculates a CRC form the > content as it parses the file in order to get a fingerprint of the data > segment so the tags can be modified and my system can still prove it's the > same underlying data. > > > > This is why I asked the original question in the way I did. > > > > I see the definition of a Sax ContentHandler, but I was hoping to find a > way to get the underlying stream. Or, if I am going about this the wrong > way I'd appreciate any advice. > > > > Cheers. > > > > On Wed, Feb 15, 2017 at 4:03 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > Hello - streaming hash functions are, in general, from a cryptographic > point of view a bad idea, but if you are just interested in checking data > integrity it might work for you. You will either have to collect all bits > of data and hash it in the end, or feed it to a hashing function that > allows for streaming data. The algorithm is up to you. > > But, on the other hand, are the files you receive that large? Does your > process at some point buffer the entire file? If so, hashing is it easy. I > don't know if Tika supports ingesting streaming data but in Apache Nutch we > buffer the entire file at some point before sending it to Apache Tika, > hashing the data is, in this case, not a problem. > > Markus > > > -----Original message----- > > From:Wshrdryr Corp <wshrd...@gmail.com> > > Sent: Thursday 16th February 2017 0:43 > > To: user@tika.apache.org > > Subject: Re: CRC ContentHandler > > > > Hello Markus, > > > > Thanks for replying. > > > > I was hoping not to have to buffer entire media files due to size. Is > there a way to get the content segment as a stream? The internal buffering > of a stream might be more efficient and less prone to spikes. > > > > Java is not my native tongue. Ive been able to hack through other API > challenges while doing this project. Googling has given me some suspicions > but not a clear answer. > > > > Cheers. > > > > On Wed, Feb 15, 2017 at 3:26 PM, Markus Jelsma < > markus.jel...@openindex.io <mailto:markus.jel...@openindex.io>> wrote: > > Hello - i dont know if media files even produce SAX events, but if they > do you can catch them in your startElement, charachters, and endElement > methods. I would start collecting element names (qName and/or attribute > values) and stuff in the character method, and append those to a > StringBuilder. > > > > > > In the endDocument method you have collected every piece of information > the ContentHandler method receives. From thereon you just call > toString().hashCode() or whatever hashing algorithm you like on the > contents accumulated in your StringBuilder. > > > > > > Regards, > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Wshrdryr Corp <wshrd...@gmail.com <mailto:wshrd...@gmail.com>> > > > > Sent: Wednesday 15th February 2017 23:22 > > > > To: user@tika.apache.org <mailto:user@tika.apache.org> > > > > > Subject: CRC ContentHandler > > > > > > > > Hello all, > > > > > > > > I need to write a Tika ContentHandler which will return a CRC and/or > hash of the non-metadata part of media files. > > > > > > > > Can anyone point me in the right direction? > > > > > > > > Im new to Tika so please forgive me if this is an obvious question. > > > > > > > > TIA for any help. > > > > > >