On 7/18/22 11:05 PM, Tilman Hausherr wrote:
Yes the file is deleted...
Alternatively, grab the source code from the trunk, and add this line in the
file
tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java
Files.write(Paths.get("/tmp/yourfile.pdf"),
Files.readAllBytes(tstream.getPath()));
after the line that has ", md5: ".
Then build the parser module, and then the standard server subproject with "mvn
-DskipTests install".
1st, attempting the build, FAILs
cd src/tika
EDIT
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
...
168 if (LOG.isDebugEnabled() && tstream != null) {
LOG.debug("File: " + tstream.getPath() + ", length:
" + tstream.getLength() +
", md5: " +
calcMD5(tstream.getPath()));
+ Files.write(Paths.get("/tmp/yourfile.pdf"),
Files.readAllBytes(tstream.getPath()));
}
...
mvn install -pl tika-parsers -am
mvn -DskipTests install
...
[INFO] BUILD FAILURE
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 31.493 s
[INFO] Finished at: 2022-07-19T04:48:43-04:00
[INFO]
------------------------------------------------------------------------
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check (validate) on project
tika-parser-pdf-module: You have 1 Checkstyle violation. -> [Help 1]
try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you
get that file.
next, run in debugger instead,
sudo -u tika /usr/bin/jdb \
-classpath /srv/tika/tika-server.jar \
org.apache.tika.server.core.TikaServerCli \
-c /etc/tika/tika-server-config-custom.xml
Initializing jdb ...
set breakpoint
> stop in org.apache.tika.parser.pdf.PDFParser
Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
It will be set after the class is loaded.
run it
> run
run org.apache.tika.server.core.TikaServerCli -c
/etc/tika/tika-server-config-custom.xml
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
>
VM Started: DEBUG [pool-2-thread-1] 05:21:37,469
org.apache.tika.server.core.TikaServerWatchDog forked process commandline:
[/usr/bin/java, -Xms1g, -Xmx1g, -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug,
-Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, -Dtika.server.id=,
org.apache.tika.server.core.TikaServerProcess, -h, 127.0.0.1, -p, 9998, -i, ,
-c, /etc/tika/tika-server-config-custom.xml, -forkedStatusFile,
/tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
...
DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl register
the server to serverRegistry
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class
org.apache.tika.server.core.ServerStatusWatcher
INFO [main] 05:21:50,906 org.apache.tika.server.core.TikaServerProcess
Started Apache Tika server at http://127.0.0.1:9998/
receive email+attachment
*lots* of debug logs @ jdb console,
-> https://pastebin.com/HDtR9RKP
NOTE, there,
...
DEBUG [qtp485047320-31] 05:22:58,423
org.apache.tika.parser.pdf.PDFParser File:
/tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5:
092bf24b2cac33fac27965549c99613a
...
but, no file captured
ls -al /tmp/apache-tika*tmp
ls: cannot access '/tmp/apache-tika*tmp': No such file or
directory
is there anything informative in that now-more-verbose DEBUG output?