The checkstyle violation is about the coding style. You can delete that part in the tika-parent/pom.xml if you want, or add <skip>true</skip> below "<configuration>" in that plugin. Same for the ossindex-maven-plugin and the forbiddenapis plugin.

If the debugger didn't stop, then the breakpoint was at the wrong place. Or it's not possible to debug.

Re "is there anything informative in that now-more-verbose DEBUG output? " well yes, the MD5 output. This proves that the file is different. (ok, the different length showed that too)

Tilman


Am 19.07.2022 um 11:37 schrieb PGNet Dev:
On 7/18/22 11:05 PM, Tilman Hausherr wrote:
Yes the file is deleted...


Alternatively, grab the source code from the trunk, and add this line in the file tika-main\tika-parsers\tika-parsers-standard\tika-parsers-standard-modules\tika-parser-pdf-module\src\main\java\org\apache\tika\parser\pdf\PDFParser.java

Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));

after the line that has ", md5: ".

Then build the parser module, and then the standard server subproject with "mvn -DskipTests install".

1st, attempting the build, FAILs

    cd src/tika
    EDIT tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

            ...
    168       if (LOG.isDebugEnabled() && tstream != null) {
                LOG.debug("File: " + tstream.getPath() + ", length: " + tstream.getLength() +
                        ", md5: " + calcMD5(tstream.getPath()));
        +        Files.write(Paths.get("/tmp/yourfile.pdf"), Files.readAllBytes(tstream.getPath()));
            }
            ...


    mvn install -pl tika-parsers -am
    mvn -DskipTests install
        ...
        [INFO] BUILD FAILURE
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time:  31.493 s
        [INFO] Finished at: 2022-07-19T04:48:43-04:00
        [INFO] ------------------------------------------------------------------------         [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.2:check (validate) on project tika-parser-pdf-module: You have 1 Checkstyle violation. -> [Help 1]


try setting a breakpoint in org.apache.tika.parser.pdf.PDFParser so that you get that file.

next, run in debugger instead,

    sudo -u tika /usr/bin/jdb \
     -classpath /srv/tika/tika-server.jar \
     org.apache.tika.server.core.TikaServerCli \
     -c /etc/tika/tika-server-config-custom.xml

        Initializing jdb ...

set breakpoint

    > stop in org.apache.tika.parser.pdf.PDFParser
    Deferring breakpoint org.apache.tika.parser.pdf.PDFParser.
    It will be set after the class is loaded.

run it

    > run
    run org.apache.tika.server.core.TikaServerCli -c /etc/tika/tika-server-config-custom.xml
    Set uncaught java.lang.Throwable
    Set deferred uncaught java.lang.Throwable
    >
    VM Started: DEBUG [pool-2-thread-1] 05:21:37,469 org.apache.tika.server.core.TikaServerWatchDog forked process commandline: [/usr/bin/java, -Xms1g, -Xmx1g, -Dpdfbox.fontcache=/var/tika, -Dlog4j2.debug, -Djava.awt.headless=true, -cp, /srv/tika/tika-server.jar, -Dtika.server.id=, org.apache.tika.server.core.TikaServerProcess, -h, 127.0.0.1, -p, 9998, -i, , -c, /etc/tika/tika-server-config-custom.xml, -forkedStatusFile, /tmp/apache-tika-server-forked-tmp-11335114907490900739, -numRestarts, 0]
    ...
    DEBUG [main] 05:21:50,871 org.apache.cxf.endpoint.ServerImpl register the server to serverRegistry     TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.tika.server.core.ServerStatusWatcher     INFO  [main] 05:21:50,906 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server  at http://127.0.0.1:9998/

receive email+attachment

*lots* of debug logs @ jdb console,

    -> https://pastebin.com/HDtR9RKP

NOTE, there,

    ...
    DEBUG [qtp485047320-31] 05:22:58,423 org.apache.tika.parser.pdf.PDFParser File: /tmp/apache-tika-11251774738482156793.tmp, length: 104932, md5: 092bf24b2cac33fac27965549c99613a
    ...

but, no file captured

    ls -al /tmp/apache-tika*tmp
        ls: cannot access '/tmp/apache-tika*tmp': No such file or directory

is there anything informative in that now-more-verbose DEBUG output?




Reply via email to