Yes the file is deleted... try setting a breakpoint in
org.apache.tika.parser.pdf.PDFParser so that you get that file.
Alternatively, grab the source code from the trunk, and add this line in
the file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
Files.write(Paths.get("/tmp/yourfile.pdf"),
Files.readAllBytes(tstream.getPath()));
after the line that has ", md5: ".
Then build the parser module, and then the standard server subproject
with "mvn -DskipTests install".
The file tika-server-standard-2.4.2-SNAPSHOT.jar will be in
tika-main\tika-server\tika-server-standard\target
I can also do it for you and upload the jar file somewhere, but
obviously that's risky.
Tilman
Am 19.07.2022 um 03:53 schrieb PGNet Dev:
I've just improved the output, I'm adding an MD5 checksum. This would
be another indicator that something is wrong (or not).
indeed.
i now see in the logs
Jul 18 21:28:23 mx-test tika[18970]: DEBUG [qtp977522995-24]
21:28:23,264 org.apache.tika.parser.pdf.PDFParser File:
/tmp/apache-tika-9115808773791090696.tmp, length: 104932, md5:
092bf24b2cac33fac27965549c99613a
checking the original attachment
ls -al Get_Started_With_Smallpdf.pdf
-rw-r--r-- 1 root root 68K Jul 15 12:16
Get_Started_With_Smallpdf.pdf
file Get_Started_With_Smallpdf.pdf
Get_Started_With_Smallpdf.pdf: PDF document, version 1.7
md5sum Get_Started_With_Smallpdf.pdf
14266e428c6a5f371c5abe164026c762 Get_Started_With_Smallpdf.pdf
checking,
ls -al /tmp/apache-tika-9115808773791090696.tmp
ls: cannot access '/tmp/apache-tika-9115808773791090696.tmp':
No such file or directory
is not persisted.
in any case, the /tmp file's NOT the same size as the orig pdf --
oddly, LARGER than the original file.
dunno what to make of that yet.
fwiw, the received attachment is verified to be identical to the sent
original.