Hello Tim,

Thanks for the pointer. I've been trying to figure out how to frame a
ticket for a theoretical CRCDigestingParser while looking at my project
spec. I foresee a need for a few kinds of as-simultaneous-as-can-be content
analysis on the same file, so I think my actual philosophical ticket is
TIKA-1509 but until that happy day

TIKA-2272 opened


On Thu, Feb 16, 2017 at 3:54 AM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Take a look at the DigestingParser, which wraps another parser.
>
>
>
> *new *DigestingParser(*p*, *new *CommonsDigester(100, CommonsDigester.
> DigestAlgorithm.*MD5*))
>
>
>
> If you need modifications, please open a ticket.
>
>
>
> *From:* Wshrdryr Corp [mailto:wshrd...@gmail.com]
> *Sent:* Wednesday, February 15, 2017 7:49 PM
>
> *To:* user@tika.apache.org
> *Subject:* Re: CRC ContentHandler
>
>
>
> Hello Markus,
>
>
>
> Thanks again for taking the time to reply.
>
>
>
> I guess I should be more specific: I am extending a Nifi component to add
> this CRC calculation, specifically here:
>
>
>
> https://github.com/apache/nifi/blob/0.x/nifi-nar-
> bundles/nifi-media-bundle/nifi-media-processors/src/
> main/java/org/apache/nifi/processors/media/ExtractMediaMetadata.java#L213
>
>
>
> Nifi uses Tika to parse files, but doesn't do anything with the content.
> My plan is to write a TIka content handler which calculates a CRC form the
> content as it parses the file in order to get a fingerprint of the data
> segment so the tags can be modified and my system can still prove it's the
> same underlying data.
>
>
>
> This is why I asked the original question in the way I did.
>
>
>
> I see the definition of a Sax ContentHandler, but I was hoping to find a
> way to get the underlying stream. Or, if I am going about this the wrong
> way I'd appreciate any advice.
>
>
>
> Cheers.
>
>
>
> On Wed, Feb 15, 2017 at 4:03 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
> Hello - streaming hash functions are, in general, from a cryptographic
> point of view a bad idea, but if you are just interested in checking data
> integrity it might work for you. You will either have to collect all bits
> of data and hash it in the end, or feed it to a hashing function that
> allows for streaming data. The algorithm is up to you.
>
> But, on the other hand, are the files you receive that large? Does your
> process at some point buffer the entire file? If so, hashing is it easy. I
> don't know if Tika supports ingesting streaming data but in Apache Nutch we
> buffer the entire file at some point before sending it to Apache Tika,
> hashing the data is, in this case, not a problem.
>
> Markus
>
>
> -----Original message-----
> > From:Wshrdryr Corp <wshrd...@gmail.com>
> > Sent: Thursday 16th February 2017 0:43
> > To: user@tika.apache.org
> > Subject: Re: CRC ContentHandler
> >
> > Hello Markus,
> >
> > Thanks for replying.
> >
> > I was hoping not to have to buffer entire media files due to size. Is
> there a way to get the content segment as a stream? The internal buffering
> of a stream might be more efficient and less prone to spikes.
> >
> > Java is not my native tongue. Ive been able to hack through other API
> challenges while doing this project. Googling has given me some suspicions
> but not a clear answer.
> >
> > Cheers.
> >
> > On Wed, Feb 15, 2017 at 3:26 PM, Markus Jelsma <
> markus.jel...@openindex.io <mailto:markus.jel...@openindex.io>> wrote:
> > Hello - i dont know if media files even produce SAX events, but if they
> do you can catch them in your startElement, charachters, and endElement
> methods. I would start collecting element names (qName and/or attribute
> values) and stuff in the character method, and append those to a
> StringBuilder.
>
> >
>
> > In the endDocument method you have collected every piece of information
> the ContentHandler method receives. From thereon you just call
> toString().hashCode() or whatever hashing algorithm you like on the
> contents accumulated in your StringBuilder.
>
> >
>
> > Regards,
>
> > Markus
>
> >
>
> >
>
> >
>
> > -----Original message-----
>
> > > From:Wshrdryr Corp <wshrd...@gmail.com <mailto:wshrd...@gmail.com>>
>
> > > Sent: Wednesday 15th February 2017 23:22
>
> > > To: user@tika.apache.org <mailto:user@tika.apache.org>
>
>
> > > Subject: CRC ContentHandler
>
> > >
>
> > > Hello all,
>
> > >
>
> > > I need to write a Tika ContentHandler which will return a CRC and/or
> hash of the non-metadata part of media files.
>
> > >
>
> > > Can anyone point me in the right direction?
>
> > >
>
> > > Im new to Tika so please forgive me if this is an obvious question.
>
> > >
>
> > > TIA for any help.
>
> >
>
>
>

Reply via email to