Sorry...just catching up on this.  If you want the digest of the incoming
bytes and you can configure tika-server via a config file, try this as the
config (e.g. tika-config-digest.xml)

<properties>
  <server>
    <params>
      <digest>sha256</digest>
    </params>
  </server>
</properties>

then start the server: java -jar tika-server-standard-xyz.jar -c
tika-config-digest.xml

Then send the file: curl -T ~/Downloads/Get_Started_With_Smallpdf.pdf
http://localhost:9998/tika

This should be in the output: <meta name="X-TIKA:digest:SHA256"
content="91184c3c4db0d5d6fdac1d33a220f208e29df1b4c06daebc0591ff6447bcfed2"/>

I confirmed this value with shasum -a 256.



On Tue, Jul 19, 2022 at 1:11 PM PGNet Dev <[email protected]> wrote:

> On 7/19/22 12:24 PM, Tilman Hausherr wrote:
> > The checkstyle violation is about the coding style. You can delete that
> part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below
> "<configuration>" in that plugin. Same for the ossindex-maven-plugin and
> the forbiddenapis plugin.
>
> > If the debugger didn't stop, then the breakpoint was at the wrong place.
> Or it's not possible to debug.
>
> I'll give the pom mod a try in a bit.
>
> As to which breakpoint, I certainly don't know the tika/java internals
> well enough to say what is/isn't correct, yet.
>
> > Re "is there anything informative in that now-more-verbose DEBUG output?
> " well yes, the MD5 output. This proves that the file is different. (ok,
> the different length showed that too)
>
> I've asked over at Dovecot ML what, specifically, dovecot 'sends' to the
> tika backend via their fts-tika plugin:
>
>    the original/complete/unmodified attachment, suggesting that the file
> size / MD5 hash should be the same as what tika's trapping
>
> or,
>
>    some modification to the file is made (trimmed, or add'l headers, etc
> etc), and that the size/hash are not _expected_ to be the same
>
> we'll see what i hear
>
>

Reply via email to