Yes the file is deleted... try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you get that file.

Alternatively, grab the source code from the trunk, and add this line in the file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java

Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));

after the line that has ", md5: ".

Then build the parser module, and then the standard server subproject with "mvn -DskipTests install".

The file tika-server-standard-2.4.2-SNAPSHOT.jar will be in

tika-main\tika-server\tika-server-standard\target

I can also do it for you and upload the jar file somewhere, but obviously that's risky.

Tilman

Am 19.07.2022 um 03:53 schrieb PGNet Dev:


I've just improved the output, I'm adding an MD5 checksum. This would be another indicator that something is wrong (or not).

indeed.

i now see in the logs

    Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24] 21:28:23,264 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a

checking the original attachment

    ls -al Get_Started_With_Smallpdf.pdf
        -rw-r--r-- 1 root root 68K Jul 15 12:16 Get_Started_With_Smallpdf.pdf

    file Get_Started_With_Smallpdf.pdf
        Get_Started_With_Smallpdf.pdf: PDF document, version 1.7

    md5sum Get_Started_With_Smallpdf.pdf
        14266e428c6a5f371c5abe164026c762 Get_Started_With_Smallpdf.pdf

checking,

    ls -al /tmp/apache-tika-9115808773791090696.tmp
        ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp': No such file or directory

is not persisted.

in any case, the  /tmp file's NOT the same size as the orig pdf -- oddly, LARGER than the original file.
dunno what to make of that yet.

fwiw, the received attachment is verified to be identical to the sent original.


Reply via email to