Am 17.07.2022 um 15:58 schrieb PGNet Dev:
On 7/16/22 10:51 PM, Tilman Hausherr wrote:
You didn't get the exception I mentioned; then set the breakpoint at parse() to get the fileLen. The current error messages suggests that bytes have been changed or have been lost.

IIRC tika saves the PDF in a file in the temp directory before parsing, maybe look there at that time and compare the length and content with your own.


i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`

That is in pdfbox, not in tika.

There's also a PDFParser.parse() in tika, which then calls PDDocument.load(). However I don't know if this will use the InputStream call, or the one with File. If it uses the one with the file, then check the length and content of the file (tika does sometimes store streams into a temporary file).

Re the failed build: remove the segment with ossindex-maven-plugin from the parent pom.xml . That plugin (or rather, the company behind it) has gone crazy, we've partly disabled it in the current trunk.

Tilman



wondering if req'd debug info is included/complete in the runnable jar, i decided to try a clean mvn build

    git checkout 2.4.1
    mvn clean
    mvn -X compile -am -pl :tika-server-standard

which fails

    ...
    [DEBUG] 82 component-reports; 16.90 ms
    [WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
    [INFO] ------------------------------------------------------------------------
    [INFO] Reactor Summary for Apache Tika parent 2.4.1:
    [INFO]
    [INFO] Apache Tika parent ................................. SUCCESS [  0.790 s]     [INFO] Apache Tika core ................................... SUCCESS [  4.806 s]     [INFO] Apache Tika serialization .......................... SUCCESS [  0.698 s]     [INFO] Apache Tika parser modules ......................... SUCCESS [  0.045 s]     [INFO] Apache Tika standard parser modules and package .... SUCCESS [  0.033 s]     [INFO] Apache Tika standard parser modules ................ SUCCESS [  0.030 s]     [INFO] Apache Tika html commons ........................... SUCCESS [  0.114 s]     [INFO] Apache Tika digest commons ......................... SUCCESS [  0.154 s]     [INFO] Apache Tika mail commons ........................... SUCCESS [  0.078 s]     [INFO] Apache Tika XMP commons ............................ SUCCESS [  0.120 s]     [INFO] Apache Tika ZIP commons ............................ SUCCESS [  0.213 s]     [INFO] Apache Tika image parser module .................... SUCCESS [  0.355 s]     [INFO] Apache Tika OCR parser module ...................... SUCCESS [  0.302 s]     [INFO] Apache Tika audiovideo parser module ............... SUCCESS [  0.369 s]     [INFO] Apache Tika text parser module ..................... SUCCESS [  0.424 s]     [INFO] Apache Tika code parser module ..................... SUCCESS [  0.205 s]     [INFO] Apache Tika html parser module ..................... SUCCESS [  0.305 s]     [INFO] Apache Tika font parser module ..................... SUCCESS [  0.078 s]     [INFO] Apache Tika XML parser module ...................... SUCCESS [  0.132 s]     [INFO] Apache Tika Microsoft parser module ................ SUCCESS [  2.600 s]     [INFO] Apache Tika package parser module .................. SUCCESS [  0.145 s]     [INFO] Apache Tika PDF parser module ...................... SUCCESS [  0.667 s]     [INFO] Apache Tika Apple parser module .................... SUCCESS [  0.216 s]     [INFO] Apache Tika cad parser module ...................... SUCCESS [  0.203 s]     [INFO] Apache Tika mail parser module ..................... SUCCESS [  0.187 s]     [INFO] Apache Tika miscellaneous office format parser module SUCCESS [  0.421 s]     [INFO] Apache Tika news parser module ..................... SUCCESS [  0.163 s]     [INFO] Apache Tika crypto parser module ................... SUCCESS [  0.106 s]     [INFO] Apache Tika WARC parser module ..................... SUCCESS [  0.104 s]     [INFO] Apache Tika standard parser package ................ SUCCESS [  0.565 s]     [INFO] Apache Tika XMP .................................... SUCCESS [  0.286 s]     [INFO] Apache Tika language detection ..................... SUCCESS [  0.021 s]     [INFO] Apache Tika langdetect test commons ................ SUCCESS [  0.057 s]     [INFO] Apache Tika Optimaize langdetect ................... SUCCESS [  0.108 s]     [INFO] Apache Tika OpenNLP langdetect ..................... SUCCESS [  0.114 s]     [INFO] Apache Tika pipes .................................. SUCCESS [  0.018 s]     [INFO] Apache Tika emitters ............................... SUCCESS [  0.017 s]     [INFO] Apache Tika filesystem emitter ..................... SUCCESS [  0.065 s]     [INFO] Apache Tika translate .............................. SUCCESS [  0.446 s]     [INFO] Apache Tika server module .......................... SUCCESS [  0.019 s]     [INFO] Apache Tika server core ............................ FAILURE [  0.112 s]
    [INFO] Apache Tika standard server ........................ SKIPPED
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  16.545 s
    [INFO] Finished at: 2022-07-17T09:41:53-04:00
    [INFO] ------------------------------------------------------------------------     [ERROR] Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:     [ERROR] org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1     [ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1     [ERROR] org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1     [ERROR]     * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
    [ERROR]
    [ERROR] Excluded coordinates:
    [ERROR]   - com.google.guava:guava:31.1-jre
    [ERROR]
    [ERROR] -> [Help 1]
    org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit (audit-dependencies) on project tika-server-core: Detected 2 vulnerable components:       org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1       org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1

    Excluded coordinates:
      - com.google.guava:guava:31.1-jre

        at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)         at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)         at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)         at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)         at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)         at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
        at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
        at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
        at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
        at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
        at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
        at java.lang.reflect.Method.invoke (Method.java:577)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)         at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)         at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)     Caused by: org.apache.maven.plugin.MojoFailureException: Detected 2 vulnerable components:       org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1       org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile; https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1         * [CVE-2022-2047] CWE-20: Improper Input Validation (2.7); https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1

    Excluded coordinates:
      - com.google.guava:guava:31.1-jre

        at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute (AuditMojoSupport.java:257)         at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)         at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210)         at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)         at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)         at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)         at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)         at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
        at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
        at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
        at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
        at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
        at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:104)
        at java.lang.reflect.Method.invoke (Method.java:577)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)         at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)         at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
    [ERROR]
    [ERROR]
    [ERROR] For more information about the errors and possible solutions, please read the following articles:     [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
    [ERROR]
    [ERROR] After correcting the problems, you can resume the build with the command
    [ERROR]   mvn <args> -rf :tika-server-core

checking @

    https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

        "Unlike many other errors, this exception is not generated by the Maven core itself but by a plugin. As a rule of thumb, plugins use this error to signal a failure of the build because there is something wrong with the dependencies or sources of a project, e.g. a compilation or a test failure."

in /tmp

immediately after tika-server start

    '/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
        ├── [-rw------- tika               0 2022-07-17 09:54:08 -0400]  apache-tika-server-forked-tmp-16337036696243797817         ├── [drwxr-xr-x tika              80 2022-07-17 09:54:08 -0400]  hsperfdata_tika         │   ├── [-rw------- tika           32768 2022-07-17 09:54:04 -0400]  15865         │   └── [-rw------- tika           32768 2022-07-17 09:54:08 -0400]  15902

, and, same -- i.e. nothing added -- after receipt of email with failed tika scan/parse

anyone have some explicit instructions for setting a catchable breakpoint in a jdb -attach to tika-server?
or, error-free build instructions?


Reply via email to