Here is the content of the metadata object:

X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser 
Content-Type=image/png

And here is the dependency tree:

[INFO] fr.pilato.elasticsearch.crawler:fscrawler-tika:jar:2.10-SNAPSHOT
[INFO] +- 
fr.pilato.elasticsearch.crawler:fscrawler-framework:jar:2.10-SNAPSHOT:compile
[INFO] | +- commons-io:commons-io:jar:2.11.0:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.13.3:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.13.3:compile
[INFO] | +- 
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.13.3:compile
[INFO] | +- 
com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.3:compile
[INFO] | | +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile
[INFO] | | \- com.fasterxml.woodstox:woodstox-core:jar:6.3.1:compile
[INFO] | +- 
com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.13.3:compile
[INFO] | | \- org.yaml:snakeyaml:jar:1.30:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.13.3:compile
[INFO] | +- com.jayway.jsonpath:json-path:jar:2.7.0:compile
[INFO] | | \- net.minidev:json-smart:jar:2.4.7:compile
[INFO] | | \- net.minidev:accessors-smart:jar:2.4.7:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.18.0:compile
[INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.18.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-jcl:jar:2.18.0:compile
[INFO] | | \- commons-logging:commons-logging:jar:1.2:compile
[INFO] | +- org.apache.logging.log4j:log4j-jul:jar:2.18.0:compile
[INFO] | \- org.fusesource.jansi:jansi:jar:2.4.0:compile
[INFO] +- 
fr.pilato.elasticsearch.crawler:fscrawler-beans:jar:2.10-SNAPSHOT:compile
[INFO] +- 
fr.pilato.elasticsearch.crawler:fscrawler-settings:jar:2.10-SNAPSHOT:compile
[INFO] +- org.apache.tika:tika-core:jar:2.4.1:compile
[INFO] | \- org.slf4j:slf4j-api:jar:1.7.36:compile
[INFO] +- org.apache.tika:tika-parsers-standard-package:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-apple-module:jar:2.4.1:compile
[INFO] | | +- org.apache.tika:tika-parser-zip-commons:jar:2.4.1:compile
[INFO] | | \- com.googlecode.plist:dd-plist:jar:1.23:compile
[INFO] | +- org.apache.tika:tika-parser-audiovideo-module:jar:2.4.1:compile
[INFO] | | \- com.drewnoakes:metadata-extractor:jar:2.18.0:compile
[INFO] | | \- com.adobe.xmp:xmpcore:jar:6.1.11:compile
[INFO] | +- org.apache.tika:tika-parser-cad-module:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-code-module:jar:2.4.1:compile
[INFO] | | +- org.codelibs:jhighlight:jar:1.1.0:compile
[INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] | | +- org.ow2.asm:asm:jar:9.3:compile
[INFO] | | +- com.epam:parso:jar:2.0.14:compile
[INFO] | | \- org.tallison:jmatio:jar:1.5:compile
[INFO] | +- org.apache.tika:tika-parser-crypto-module:jar:2.4.1:compile
[INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.70:compile
[INFO] | | | +- org.bouncycastle:bcutil-jdk15on:jar:1.70:compile
[INFO] | | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.70:compile
[INFO] | | \- org.bouncycastle:bcprov-jdk15on:jar:1.70:compile
[INFO] | +- org.apache.tika:tika-parser-digest-commons:jar:2.4.1:compile
[INFO] | | \- commons-codec:commons-codec:jar:1.15:compile
[INFO] | +- org.apache.tika:tika-parser-font-module:jar:2.4.1:compile
[INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.26:compile
[INFO] | +- org.apache.tika:tika-parser-html-module:jar:2.4.1:compile
[INFO] | | \- org.apache.tika:tika-parser-html-commons:jar:2.4.1:compile
[INFO] | | \- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] | +- org.apache.tika:tika-parser-image-module:jar:2.4.1:compile
[INFO] | | +- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile
[INFO] | | \- org.apache.pdfbox:jbig2-imageio:jar:3.0.4:compile
[INFO] | +- org.apache.tika:tika-parser-mail-module:jar:2.4.1:compile
[INFO] | | \- org.apache.tika:tika-parser-mail-commons:jar:2.4.1:compile
[INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.4:compile
[INFO] | | \- org.apache.james:apache-mime4j-dom:jar:0.8.4:compile
[INFO] | +- org.apache.tika:tika-parser-microsoft-module:jar:2.4.1:compile
[INFO] | | +- com.pff:java-libpst:jar:0.9.3:compile
[INFO] | | +- org.apache.commons:commons-lang3:jar:3.12.0:compile
[INFO] | | +- org.apache.poi:poi:jar:5.2.2:compile
[INFO] | | | +- org.apache.commons:commons-math3:jar:3.6.1:compile
[INFO] | | | \- com.zaxxer:SparseBitSet:jar:1.2:compile
[INFO] | | +- org.apache.poi:poi-scratchpad:jar:5.2.2:compile
[INFO] | | +- org.apache.poi:poi-ooxml:jar:5.2.2:compile
[INFO] | | | +- org.apache.poi:poi-ooxml-lite:jar:5.2.2:compile
[INFO] | | | +- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile
[INFO] | | | \- com.github.virtuald:curvesapi:jar:1.07:compile
[INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:4.0.1:compile
[INFO] | | \- 
com.healthmarketscience.jackcess:jackcess-encrypt:jar:4.0.1:compile
[INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.36:compile
[INFO] | +- org.apache.tika:tika-parser-miscoffice-module:jar:2.4.1:compile
[INFO] | | \- org.apache.commons:commons-collections4:jar:4.4:compile
[INFO] | +- org.apache.tika:tika-parser-news-module:jar:2.4.1:compile
[INFO] | | \- com.rometools:rome:jar:1.18.0:compile
[INFO] | | \- com.rometools:rome-utils:jar:1.18.0:compile
[INFO] | +- org.apache.tika:tika-parser-ocr-module:jar:2.4.1:compile
[INFO] | | \- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] | +- org.apache.tika:tika-parser-pdf-module:jar:2.4.1:compile
[INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.26:compile
[INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.26:compile
[INFO] | | | \- org.apache.pdfbox:pdfbox-debugger:jar:2.0.26:compile
[INFO] | | \- org.apache.pdfbox:jempbox:jar:1.8.16:compile
[INFO] | +- org.apache.tika:tika-parser-pkg-module:jar:2.4.1:compile
[INFO] | | +- org.tukaani:xz:jar:1.9:compile
[INFO] | | +- org.brotli:dec:jar:0.1.2:compile
[INFO] | | \- com.github.junrar:junrar:jar:7.5.2:compile
[INFO] | +- org.apache.tika:tika-parser-text-module:jar:2.4.1:compile
[INFO] | | \- 
com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] | +- org.apache.tika:tika-parser-webarchive-module:jar:2.4.1:compile
[INFO] | | +- org.netpreserve:jwarc:jar:0.18.1:compile
[INFO] | | \- org.apache.commons:commons-compress:jar:1.21:compile
[INFO] | +- org.apache.tika:tika-parser-xml-module:jar:2.4.1:compile
[INFO] | | \- xerces:xercesImpl:jar:2.12.2:compile
[INFO] | | \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] | +- org.apache.tika:tika-parser-xmp-commons:jar:2.4.1:compile
[INFO] | | \- org.apache.pdfbox:xmpbox:jar:2.0.26:compile
[INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] | \- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] +- org.apache.tika:tika-parser-scientific-module:jar:2.4.1:compile
[INFO] | +- org.apache.sis.core:sis-utility:jar:1.2:compile
[INFO] | | \- javax.measure:unit-api:jar:1.0:compile
[INFO] | +- org.apache.sis.storage:sis-netcdf:jar:1.2:compile
[INFO] | | +- org.apache.sis.storage:sis-storage:jar:1.2:compile
[INFO] | | | \- org.apache.sis.core:sis-feature:jar:1.2:compile
[INFO] | | \- org.apache.sis.core:sis-referencing:jar:1.2:compile
[INFO] | +- org.apache.sis.core:sis-metadata:jar:1.2:compile
[INFO] | | \- jakarta.xml.bind:jakarta.xml.bind-api:jar:3.0.1:compile
[INFO] | +- org.opengis:geoapi:jar:3.0.1:compile
[INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile
[INFO] | | +- edu.ucar:cdm:jar:4.5.5:compile
[INFO] | | | +- edu.ucar:udunits:jar:4.5.5:compile
[INFO] | | | +- edu.ucar:httpservices:jar:4.5.5:compile
[INFO] | | | | +- org.apache.httpcomponents:httpclient:jar:4.5.13:compile
[INFO] | | | | \- org.apache.httpcomponents:httpmime:jar:4.5.13:compile
[INFO] | | | +- org.apache.httpcomponents:httpcore:jar:4.4.15:compile
[INFO] | | | +- joda-time:joda-time:jar:2.11.1:compile
[INFO] | | | +- org.quartz-scheduler:quartz:jar:2.3.2:compile
[INFO] | | | | +- com.mchange:c3p0:jar:0.9.5.4:compile
[INFO] | | | | +- com.mchange:mchange-commons-java:jar:0.2.15:compile
[INFO] | | | | \- com.zaxxer:HikariCP-java7:jar:2.4.13:compile
[INFO] | | | \- com.beust:jcommander:jar:1.82:compile
[INFO] | | \- net.java.dev.jna:jna:jar:5.12.1:compile
[INFO] | +- edu.ucar:grib:jar:4.5.5:compile
[INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] | | +- org.jdom:jdom2:jar:2.0.6.1:compile
[INFO] | | +- edu.ucar:jj2000:jar:5.2:compile
[INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile
[INFO] | +- net.jcip:jcip-annotations:jar:1.0:compile
[INFO] | +- org.apache.commons:commons-csv:jar:1.9.0:compile
[INFO] | \- org.glassfish.jaxb:jaxb-runtime:jar:2.3.6:compile
[INFO] | +- org.glassfish.jaxb:txw2:jar:2.3.6:compile
[INFO] | +- com.sun.istack:istack-commons-runtime:jar:3.0.12:compile
[INFO] | \- com.sun.activation:jakarta.activation:jar:2.0.1:compile
[INFO] +- org.apache.tika:tika-parser-sqlite3-module:jar:2.4.1:compile
[INFO] | +- org.apache.tika:tika-parser-jdbc-commons:jar:2.4.1:compile
[INFO] | \- org.xerial:sqlite-jdbc:jar:3.36.0.3:compile
[INFO] +- org.apache.tika:tika-langdetect-optimaize:jar:2.4.1:compile
[INFO] | \- com.optimaize.languagedetector:language-detector:jar:0.6:compile
[INFO] | +- net.arnx:jsonic:jar:1.2.11:compile
[INFO] | +- com.intellij:annotations:jar:12.0:compile
[INFO] | \- com.google.guava:guava:jar:31.1-jre:compile
[INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile
[INFO] | +- 
com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile
[INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.2:compile
[INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile
[INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.11.0:compile
[INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile
[INFO] +- com.jcraft:jsch:jar:0.1.55:compile
[INFO] +- 
fr.pilato.elasticsearch.crawler:fscrawler-test-framework:jar:2.10-SNAPSHOT:test
[INFO] | +- org.hamcrest:hamcrest-all:jar:1.3:test
[INFO] | +- junit:junit:jar:4.13.2:test
[INFO] | | \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] | \- 
com.carrotsearch.randomizedtesting:randomizedtesting-runner:jar:2.8.1:test
[INFO] \- 
fr.pilato.elasticsearch.crawler:fscrawler-test-documents:jar:2.10-SNAPSHOT:test

David
Le 1 sept. 2022 à 11:40 +0200, Tim Allison <[email protected]>, a écrit :
> And, what is recorded in the X-Tika-ParsedBy value in the metadata object?
>
> > On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <[email protected]> wrote:
> > > What are your dependencies? Which parsers are in AutoDetectParser?
> > >
> > > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> wrote:
> > > > > Hey team
> > > > >
> > > > >
> > > > > I'm wondering what's wrong with my config.
> > > > > I'm running this very basic piece of code:
> > > > > @Test
> > > > > public void testTika() throws TikaException, IOException, 
> > > > > SAXException {
> > > > >    BodyContentHandler handler = new BodyContentHandler(new 
> > > > > WriteOutContentHandler(1000));
> > > > >    new AutoDetectParser().parse(getBinaryContent("test-ocr.png"), 
> > > > > handler, new Metadata(), new ParseContext());
> > > > >    System.out.println("handler = " + handler);
> > > > > }
> > > > >
> > > > > Here are my logs:
> > > > >
> > > > > 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: 
> > > > > [tesseract]): true
> > > > > 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: 
> > > > > [tesseract]): true
> > > > > 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does 
> > > > > not appear to be installed (commandline: convert)
> > > > > 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: 
> > > > > [tesseract]): true
> > > > > 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does 
> > > > > not appear to be installed (commandline: convert)
> > > > > handler =
> > > > >
> > > > >
> > > > > The content is not extracted although Tesseract is detected.
> > > > >
> > > > > When I run Tesseract manually:
> > > > >
> > > > > tesseract test-ocr.png tess.out
> > > > > cat tess.out.txt
> > > > >
> > > > > I'm getting:
> > > > >
> > > > > This file contains some words.
> > > > >
> > > > > tesseract --version gives
> > > > >
> > > > > tesseract 5.2.0
> > > > >  leptonica-1.82.0
> > > > >  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37 : 
> > > > > libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0
> > > > >  Found AVX2
> > > > >  Found AVX
> > > > >  Found FMA
> > > > >  Found SSE4.1
> > > > >  Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 
> > > > > liblz4/1.9.3 libzstd/1.5.2
> > > > >  Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 
> > > > > nghttp2/1.45.1
> > > > >
> > > > >
> > > > > What I'm missing here?
> > > > >
> > > > >
> > > > >
> > > > > David

Reply via email to