Am 17.07.2022 um 15:58 schrieb PGNet Dev:
On 7/16/22 10:51 PM, Tilman Hausherr wrote:
You didn't get the exception I mentioned; then set the breakpoint at
parse() to get the fileLen. The current error messages suggests that
bytes have been changed or have been lost.
IIRC tika saves the PDF in a file in the temp directory before
parsing, maybe look there at that time and compare the length and
content with your own.
i haven't managed to stop at any *.parse bkpt i set after `jdb -attach`
That is in pdfbox, not in tika.
There's also a PDFParser.parse() in tika, which then calls
PDDocument.load(). However I don't know if this will use the InputStream
call, or the one with File. If it uses the one with the file, then check
the length and content of the file (tika does sometimes store streams
into a temporary file).
Re the failed build: remove the segment with ossindex-maven-plugin from
the parent pom.xml . That plugin (or rather, the company behind it) has
gone crazy, we've partly disabled it in the current trunk.
Tilman
wondering if req'd debug info is included/complete in the runnable
jar, i decided to try a clean mvn build
git checkout 2.4.1
mvn clean
mvn -X compile -am -pl :tika-server-standard
which fails
...
[DEBUG] 82 component-reports; 16.90 ms
[WARNING] Excluding coordinates: com.google.guava:guava:31.1-jre
[INFO]
------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Tika parent 2.4.1:
[INFO]
[INFO] Apache Tika parent .................................
SUCCESS [ 0.790 s]
[INFO] Apache Tika core ...................................
SUCCESS [ 4.806 s]
[INFO] Apache Tika serialization ..........................
SUCCESS [ 0.698 s]
[INFO] Apache Tika parser modules .........................
SUCCESS [ 0.045 s]
[INFO] Apache Tika standard parser modules and package ....
SUCCESS [ 0.033 s]
[INFO] Apache Tika standard parser modules ................
SUCCESS [ 0.030 s]
[INFO] Apache Tika html commons ...........................
SUCCESS [ 0.114 s]
[INFO] Apache Tika digest commons .........................
SUCCESS [ 0.154 s]
[INFO] Apache Tika mail commons ...........................
SUCCESS [ 0.078 s]
[INFO] Apache Tika XMP commons ............................
SUCCESS [ 0.120 s]
[INFO] Apache Tika ZIP commons ............................
SUCCESS [ 0.213 s]
[INFO] Apache Tika image parser module ....................
SUCCESS [ 0.355 s]
[INFO] Apache Tika OCR parser module ......................
SUCCESS [ 0.302 s]
[INFO] Apache Tika audiovideo parser module ...............
SUCCESS [ 0.369 s]
[INFO] Apache Tika text parser module .....................
SUCCESS [ 0.424 s]
[INFO] Apache Tika code parser module .....................
SUCCESS [ 0.205 s]
[INFO] Apache Tika html parser module .....................
SUCCESS [ 0.305 s]
[INFO] Apache Tika font parser module .....................
SUCCESS [ 0.078 s]
[INFO] Apache Tika XML parser module ......................
SUCCESS [ 0.132 s]
[INFO] Apache Tika Microsoft parser module ................
SUCCESS [ 2.600 s]
[INFO] Apache Tika package parser module ..................
SUCCESS [ 0.145 s]
[INFO] Apache Tika PDF parser module ......................
SUCCESS [ 0.667 s]
[INFO] Apache Tika Apple parser module ....................
SUCCESS [ 0.216 s]
[INFO] Apache Tika cad parser module ......................
SUCCESS [ 0.203 s]
[INFO] Apache Tika mail parser module .....................
SUCCESS [ 0.187 s]
[INFO] Apache Tika miscellaneous office format parser module
SUCCESS [ 0.421 s]
[INFO] Apache Tika news parser module .....................
SUCCESS [ 0.163 s]
[INFO] Apache Tika crypto parser module ...................
SUCCESS [ 0.106 s]
[INFO] Apache Tika WARC parser module .....................
SUCCESS [ 0.104 s]
[INFO] Apache Tika standard parser package ................
SUCCESS [ 0.565 s]
[INFO] Apache Tika XMP ....................................
SUCCESS [ 0.286 s]
[INFO] Apache Tika language detection .....................
SUCCESS [ 0.021 s]
[INFO] Apache Tika langdetect test commons ................
SUCCESS [ 0.057 s]
[INFO] Apache Tika Optimaize langdetect ...................
SUCCESS [ 0.108 s]
[INFO] Apache Tika OpenNLP langdetect .....................
SUCCESS [ 0.114 s]
[INFO] Apache Tika pipes ..................................
SUCCESS [ 0.018 s]
[INFO] Apache Tika emitters ...............................
SUCCESS [ 0.017 s]
[INFO] Apache Tika filesystem emitter .....................
SUCCESS [ 0.065 s]
[INFO] Apache Tika translate ..............................
SUCCESS [ 0.446 s]
[INFO] Apache Tika server module ..........................
SUCCESS [ 0.019 s]
[INFO] Apache Tika server core ............................
FAILURE [ 0.112 s]
[INFO] Apache Tika standard server ........................ SKIPPED
[INFO]
------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 16.545 s
[INFO] Finished at: 2022-07-17T09:41:53-04:00
[INFO]
------------------------------------------------------------------------
[ERROR] Failed to execute goal
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit
(audit-dependencies) on project tika-server-core: Detected 2
vulnerable components:
[ERROR]
org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation
(2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR] * [CVE-2022-2047] CWE-20: Improper Input Validation
(2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
[ERROR]
[ERROR] Excluded coordinates:
[ERROR] - com.google.guava:guava:31.1-jre
[ERROR]
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
execute goal
org.sonatype.ossindex.maven:ossindex-maven-plugin:3.2.0:audit
(audit-dependencies) on project tika-server-core: Detected 2
vulnerable components:
org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
Excluded coordinates:
- com.google.guava:guava:31.1-jre
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:148)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
(LifecycleModuleBuilder.java:117)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
(LifecycleModuleBuilder.java:81)
at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
(SingleThreadedBuilder.java:56)
at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute
(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute
(DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute
(DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke
(DirectMethodHandleAccessor.java:104)
at java.lang.reflect.Method.invoke (Method.java:577)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced
(Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch
(Launcher.java:225)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode
(Launcher.java:406)
at org.codehaus.plexus.classworlds.launcher.Launcher.main
(Launcher.java:347)
Caused by: org.apache.maven.plugin.MojoFailureException: Detected
2 vulnerable components:
org.eclipse.jetty:jetty-server:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-server&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
org.eclipse.jetty:jetty-http:jar:9.4.46.v20220331:compile;
https://ossindex.sonatype.org/component/pkg:maven/org.eclipse.jetty/[email protected]?utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
* [CVE-2022-2047] CWE-20: Improper Input Validation (2.7);
https://ossindex.sonatype.org/vulnerability/CVE-2022-2047?component-type=maven&component-name=org.eclipse.jetty%2Fjetty-http&utm_source=ossindex-client&utm_medium=integration&utm_content=1.8.1
Excluded coordinates:
- com.google.guava:guava:31.1-jre
at org.sonatype.ossindex.maven.plugin.AuditMojoSupport.execute
(AuditMojoSupport.java:257)
at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo
(DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:210)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
(MojoExecutor.java:148)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
(LifecycleModuleBuilder.java:117)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
(LifecycleModuleBuilder.java:81)
at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
(SingleThreadedBuilder.java:56)
at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute
(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute
(DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute
(DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:972)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke
(DirectMethodHandleAccessor.java:104)
at java.lang.reflect.Method.invoke (Method.java:577)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced
(Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch
(Launcher.java:225)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode
(Launcher.java:406)
at org.codehaus.plexus.classworlds.launcher.Launcher.main
(Launcher.java:347)
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible
solutions, please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build
with the command
[ERROR] mvn <args> -rf :tika-server-core
checking @
https://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
"Unlike many other errors, this exception is not generated by
the Maven core itself but by a plugin. As a rule of thumb, plugins use
this error to signal a failure of the build because there is something
wrong with the dependencies or sources of a project, e.g. a
compilation or a test failure."
in /tmp
immediately after tika-server start
'/usr/bin/tree -Csup --timefmt "%F %R:%S %z"' /tmp | grep tika
├── [-rw------- tika 0 2022-07-17 09:54:08
-0400] apache-tika-server-forked-tmp-16337036696243797817
├── [drwxr-xr-x tika 80 2022-07-17 09:54:08
-0400] hsperfdata_tika
│ ├── [-rw------- tika 32768 2022-07-17 09:54:04
-0400] 15865
│ └── [-rw------- tika 32768 2022-07-17 09:54:08
-0400] 15902
, and, same -- i.e. nothing added -- after receipt of email with
failed tika scan/parse
anyone have some explicit instructions for setting a catchable
breakpoint in a jdb -attach to tika-server?
or, error-free build instructions?