Great to know. Thanks Tim!
David Le 1 sept. 2022 à 14:58 +0200, Tim Allison <[email protected]>, a écrit : > Ugh. I think you just ran into: > https://issues.apache.org/jira/browse/TIKA-3812 > > This will be fixed in the next release, hopefully out next week. > > The problem is that gdal is taking precedence over the ImageParser, and the > gdal parser doesn't know about OCR. > > > On Thu, Sep 1, 2022 at 7:43 AM David Pilato <[email protected]> wrote: > > > Here is the content of the metadata object: > > > > > > X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser > > > X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser > > > X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser > > > X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser > > > Content-Type=image/png > > > > > > And here is the dependency tree: > > > > > > [INFO] fr.pilato.elasticsearch.crawler:fscrawler-tika:jar:2.10-SNAPSHOT > > > [INFO] +- > > > fr.pilato.elasticsearch.crawler:fscrawler-framework:jar:2.10-SNAPSHOT:compile > > > [INFO] | +- commons-io:commons-io:jar:2.11.0:compile > > > [INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.13.3:compile > > > [INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.13.3:compile > > > [INFO] | +- > > > com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.13.3:compile > > > [INFO] | +- > > > com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.3:compile > > > [INFO] | | +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile > > > [INFO] | | \- com.fasterxml.woodstox:woodstox-core:jar:6.3.1:compile > > > [INFO] | +- > > > com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.13.3:compile > > > [INFO] | | \- org.yaml:snakeyaml:jar:1.30:compile > > > [INFO] | +- > > > com.fasterxml.jackson.core:jackson-annotations:jar:2.13.3:compile > > > [INFO] | +- com.jayway.jsonpath:json-path:jar:2.7.0:compile > > > [INFO] | | \- net.minidev:json-smart:jar:2.4.7:compile > > > [INFO] | | \- net.minidev:accessors-smart:jar:2.4.7:compile > > > [INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.18.0:compile > > > [INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.18.0:compile > > > [INFO] | +- org.apache.logging.log4j:log4j-1.2-api:jar:2.18.0:compile > > > [INFO] | +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.18.0:compile > > > [INFO] | +- org.apache.logging.log4j:log4j-jcl:jar:2.18.0:compile > > > [INFO] | | \- commons-logging:commons-logging:jar:1.2:compile > > > [INFO] | +- org.apache.logging.log4j:log4j-jul:jar:2.18.0:compile > > > [INFO] | \- org.fusesource.jansi:jansi:jar:2.4.0:compile > > > [INFO] +- > > > fr.pilato.elasticsearch.crawler:fscrawler-beans:jar:2.10-SNAPSHOT:compile > > > [INFO] +- > > > fr.pilato.elasticsearch.crawler:fscrawler-settings:jar:2.10-SNAPSHOT:compile > > > [INFO] +- org.apache.tika:tika-core:jar:2.4.1:compile > > > [INFO] | \- org.slf4j:slf4j-api:jar:1.7.36:compile > > > [INFO] +- org.apache.tika:tika-parsers-standard-package:jar:2.4.1:compile > > > [INFO] | +- org.apache.tika:tika-parser-apple-module:jar:2.4.1:compile > > > [INFO] | | +- org.apache.tika:tika-parser-zip-commons:jar:2.4.1:compile > > > [INFO] | | \- com.googlecode.plist:dd-plist:jar:1.23:compile > > > [INFO] | +- > > > org.apache.tika:tika-parser-audiovideo-module:jar:2.4.1:compile > > > [INFO] | | \- com.drewnoakes:metadata-extractor:jar:2.18.0:compile > > > [INFO] | | \- com.adobe.xmp:xmpcore:jar:6.1.11:compile > > > [INFO] | +- org.apache.tika:tika-parser-cad-module:jar:2.4.1:compile > > > [INFO] | +- org.apache.tika:tika-parser-code-module:jar:2.4.1:compile > > > [INFO] | | +- org.codelibs:jhighlight:jar:1.1.0:compile > > > [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile > > > [INFO] | | +- org.ow2.asm:asm:jar:9.3:compile > > > [INFO] | | +- com.epam:parso:jar:2.0.14:compile > > > [INFO] | | \- org.tallison:jmatio:jar:1.5:compile > > > [INFO] | +- org.apache.tika:tika-parser-crypto-module:jar:2.4.1:compile > > > [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.70:compile > > > [INFO] | | | +- org.bouncycastle:bcutil-jdk15on:jar:1.70:compile > > > [INFO] | | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.70:compile > > > [INFO] | | \- org.bouncycastle:bcprov-jdk15on:jar:1.70:compile > > > [INFO] | +- org.apache.tika:tika-parser-digest-commons:jar:2.4.1:compile > > > [INFO] | | \- commons-codec:commons-codec:jar:1.15:compile > > > [INFO] | +- org.apache.tika:tika-parser-font-module:jar:2.4.1:compile > > > [INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.26:compile > > > [INFO] | +- org.apache.tika:tika-parser-html-module:jar:2.4.1:compile > > > [INFO] | | \- org.apache.tika:tika-parser-html-commons:jar:2.4.1:compile > > > [INFO] | | \- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile > > > [INFO] | +- org.apache.tika:tika-parser-image-module:jar:2.4.1:compile > > > [INFO] | | +- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile > > > [INFO] | | \- org.apache.pdfbox:jbig2-imageio:jar:3.0.4:compile > > > [INFO] | +- org.apache.tika:tika-parser-mail-module:jar:2.4.1:compile > > > [INFO] | | \- org.apache.tika:tika-parser-mail-commons:jar:2.4.1:compile > > > [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.4:compile > > > [INFO] | | \- org.apache.james:apache-mime4j-dom:jar:0.8.4:compile > > > [INFO] | +- org.apache.tika:tika-parser-microsoft-module:jar:2.4.1:compile > > > [INFO] | | +- com.pff:java-libpst:jar:0.9.3:compile > > > [INFO] | | +- org.apache.commons:commons-lang3:jar:3.12.0:compile > > > [INFO] | | +- org.apache.poi:poi:jar:5.2.2:compile > > > [INFO] | | | +- org.apache.commons:commons-math3:jar:3.6.1:compile > > > [INFO] | | | \- com.zaxxer:SparseBitSet:jar:1.2:compile > > > [INFO] | | +- org.apache.poi:poi-scratchpad:jar:5.2.2:compile > > > [INFO] | | +- org.apache.poi:poi-ooxml:jar:5.2.2:compile > > > [INFO] | | | +- org.apache.poi:poi-ooxml-lite:jar:5.2.2:compile > > > [INFO] | | | +- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile > > > [INFO] | | | \- com.github.virtuald:curvesapi:jar:1.07:compile > > > [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:4.0.1:compile > > > [INFO] | | \- > > > com.healthmarketscience.jackcess:jackcess-encrypt:jar:4.0.1:compile > > > [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.36:compile > > > [INFO] | +- > > > org.apache.tika:tika-parser-miscoffice-module:jar:2.4.1:compile > > > [INFO] | | \- org.apache.commons:commons-collections4:jar:4.4:compile > > > [INFO] | +- org.apache.tika:tika-parser-news-module:jar:2.4.1:compile > > > [INFO] | | \- com.rometools:rome:jar:1.18.0:compile > > > [INFO] | | \- com.rometools:rome-utils:jar:1.18.0:compile > > > [INFO] | +- org.apache.tika:tika-parser-ocr-module:jar:2.4.1:compile > > > [INFO] | | \- org.apache.commons:commons-exec:jar:1.3:compile > > > [INFO] | +- org.apache.tika:tika-parser-pdf-module:jar:2.4.1:compile > > > [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.26:compile > > > [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.26:compile > > > [INFO] | | | \- org.apache.pdfbox:pdfbox-debugger:jar:2.0.26:compile > > > [INFO] | | \- org.apache.pdfbox:jempbox:jar:1.8.16:compile > > > [INFO] | +- org.apache.tika:tika-parser-pkg-module:jar:2.4.1:compile > > > [INFO] | | +- org.tukaani:xz:jar:1.9:compile > > > [INFO] | | +- org.brotli:dec:jar:0.1.2:compile > > > [INFO] | | \- com.github.junrar:junrar:jar:7.5.2:compile > > > [INFO] | +- org.apache.tika:tika-parser-text-module:jar:2.4.1:compile > > > [INFO] | | \- > > > com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile > > > [INFO] | +- > > > org.apache.tika:tika-parser-webarchive-module:jar:2.4.1:compile > > > [INFO] | | +- org.netpreserve:jwarc:jar:0.18.1:compile > > > [INFO] | | \- org.apache.commons:commons-compress:jar:1.21:compile > > > [INFO] | +- org.apache.tika:tika-parser-xml-module:jar:2.4.1:compile > > > [INFO] | | \- xerces:xercesImpl:jar:2.12.2:compile > > > [INFO] | | \- xml-apis:xml-apis:jar:1.4.01:compile > > > [INFO] | +- org.apache.tika:tika-parser-xmp-commons:jar:2.4.1:compile > > > [INFO] | | \- org.apache.pdfbox:xmpbox:jar:2.0.26:compile > > > [INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile > > > [INFO] | \- org.gagravarr:vorbis-java-core:jar:0.8:compile > > > [INFO] +- org.apache.tika:tika-parser-scientific-module:jar:2.4.1:compile > > > [INFO] | +- org.apache.sis.core:sis-utility:jar:1.2:compile > > > [INFO] | | \- javax.measure:unit-api:jar:1.0:compile > > > [INFO] | +- org.apache.sis.storage:sis-netcdf:jar:1.2:compile > > > [INFO] | | +- org.apache.sis.storage:sis-storage:jar:1.2:compile > > > [INFO] | | | \- org.apache.sis.core:sis-feature:jar:1.2:compile > > > [INFO] | | \- org.apache.sis.core:sis-referencing:jar:1.2:compile > > > [INFO] | +- org.apache.sis.core:sis-metadata:jar:1.2:compile > > > [INFO] | | \- jakarta.xml.bind:jakarta.xml.bind-api:jar:3.0.1:compile > > > [INFO] | +- org.opengis:geoapi:jar:3.0.1:compile > > > [INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile > > > [INFO] | | +- edu.ucar:cdm:jar:4.5.5:compile > > > [INFO] | | | +- edu.ucar:udunits:jar:4.5.5:compile > > > [INFO] | | | +- edu.ucar:httpservices:jar:4.5.5:compile > > > [INFO] | | | | +- org.apache.httpcomponents:httpclient:jar:4.5.13:compile > > > [INFO] | | | | \- org.apache.httpcomponents:httpmime:jar:4.5.13:compile > > > [INFO] | | | +- org.apache.httpcomponents:httpcore:jar:4.4.15:compile > > > [INFO] | | | +- joda-time:joda-time:jar:2.11.1:compile > > > [INFO] | | | +- org.quartz-scheduler:quartz:jar:2.3.2:compile > > > [INFO] | | | | +- com.mchange:c3p0:jar:0.9.5.4:compile > > > [INFO] | | | | +- com.mchange:mchange-commons-java:jar:0.2.15:compile > > > [INFO] | | | | \- com.zaxxer:HikariCP-java7:jar:2.4.13:compile > > > [INFO] | | | \- com.beust:jcommander:jar:1.82:compile > > > [INFO] | | \- net.java.dev.jna:jna:jar:5.12.1:compile > > > [INFO] | +- edu.ucar:grib:jar:4.5.5:compile > > > [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile > > > [INFO] | | +- org.jdom:jdom2:jar:2.0.6.1:compile > > > [INFO] | | +- edu.ucar:jj2000:jar:5.2:compile > > > [INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile > > > [INFO] | +- net.jcip:jcip-annotations:jar:1.0:compile > > > [INFO] | +- org.apache.commons:commons-csv:jar:1.9.0:compile > > > [INFO] | \- org.glassfish.jaxb:jaxb-runtime:jar:2.3.6:compile > > > [INFO] | +- org.glassfish.jaxb:txw2:jar:2.3.6:compile > > > [INFO] | +- com.sun.istack:istack-commons-runtime:jar:3.0.12:compile > > > [INFO] | \- com.sun.activation:jakarta.activation:jar:2.0.1:compile > > > [INFO] +- org.apache.tika:tika-parser-sqlite3-module:jar:2.4.1:compile > > > [INFO] | +- org.apache.tika:tika-parser-jdbc-commons:jar:2.4.1:compile > > > [INFO] | \- org.xerial:sqlite-jdbc:jar:3.36.0.3:compile > > > [INFO] +- org.apache.tika:tika-langdetect-optimaize:jar:2.4.1:compile > > > [INFO] | \- > > > com.optimaize.languagedetector:language-detector:jar:0.6:compile > > > [INFO] | +- net.arnx:jsonic:jar:1.2.11:compile > > > [INFO] | +- com.intellij:annotations:jar:12.0:compile > > > [INFO] | \- com.google.guava:guava:jar:31.1-jre:compile > > > [INFO] | +- com.google.guava:failureaccess:jar:1.0.1:compile > > > [INFO] | +- > > > com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:compile > > > [INFO] | +- com.google.code.findbugs:jsr305:jar:3.0.2:compile > > > [INFO] | +- org.checkerframework:checker-qual:jar:3.12.0:compile > > > [INFO] | +- > > > com.google.errorprone:error_prone_annotations:jar:2.11.0:compile > > > [INFO] | \- com.google.j2objc:j2objc-annotations:jar:1.3:compile > > > [INFO] +- com.jcraft:jsch:jar:0.1.55:compile > > > [INFO] +- > > > fr.pilato.elasticsearch.crawler:fscrawler-test-framework:jar:2.10-SNAPSHOT:test > > > [INFO] | +- org.hamcrest:hamcrest-all:jar:1.3:test > > > [INFO] | +- junit:junit:jar:4.13.2:test > > > [INFO] | | \- org.hamcrest:hamcrest-core:jar:1.3:test > > > [INFO] | \- > > > com.carrotsearch.randomizedtesting:randomizedtesting-runner:jar:2.8.1:test > > > [INFO] \- > > > fr.pilato.elasticsearch.crawler:fscrawler-test-documents:jar:2.10-SNAPSHOT:test > > > > > > David > > > Le 1 sept. 2022 à 11:40 +0200, Tim Allison <[email protected]>, a écrit > > > : > > > > And, what is recorded in the X-Tika-ParsedBy value in the metadata > > > > object? > > > > > > > > > On Thu, Sep 1, 2022 at 5:36 AM Tim Allison <[email protected]> > > > > > wrote: > > > > > > What are your dependencies? Which parsers are in AutoDetectParser? > > > > > > > > > > > > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato <[email protected]> > > > > > > > wrote: > > > > > > > > Hey team > > > > > > > > > > > > > > > > > > > > > > > > I'm wondering what's wrong with my config. > > > > > > > > I'm running this very basic piece of code: > > > > > > > > @Test > > > > > > > > public void testTika() throws TikaException, IOException, > > > > > > > > SAXException { > > > > > > > > BodyContentHandler handler = new BodyContentHandler(new > > > > > > > > WriteOutContentHandler(1000)); > > > > > > > > new > > > > > > > > AutoDetectParser().parse(getBinaryContent("test-ocr.png"), > > > > > > > > handler, new Metadata(), new ParseContext()); > > > > > > > > System.out.println("handler = " + handler); > > > > > > > > } > > > > > > > > > > > > > > > > Here are my logs: > > > > > > > > > > > > > > > > 16:31:13,089 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract > > > > > > > > (path: [tesseract]): true > > > > > > > > 16:31:13,560 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract > > > > > > > > (path: [tesseract]): true > > > > > > > > 16:31:13,564 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick > > > > > > > > does not appear to be installed (commandline: convert) > > > > > > > > 16:31:13,591 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract > > > > > > > > (path: [tesseract]): true > > > > > > > > 16:31:13,595 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick > > > > > > > > does not appear to be installed (commandline: convert) > > > > > > > > handler = > > > > > > > > > > > > > > > > > > > > > > > > The content is not extracted although Tesseract is detected. > > > > > > > > > > > > > > > > When I run Tesseract manually: > > > > > > > > > > > > > > > > tesseract test-ocr.png tess.out > > > > > > > > cat tess.out.txt > > > > > > > > > > > > > > > > I'm getting: > > > > > > > > > > > > > > > > This file contains some words. > > > > > > > > > > > > > > > > tesseract --version gives > > > > > > > > > > > > > > > > tesseract 5.2.0 > > > > > > > > leptonica-1.82.0 > > > > > > > > libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng > > > > > > > > 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : > > > > > > > > libopenjp2 2.5.0 > > > > > > > > Found AVX2 > > > > > > > > Found AVX > > > > > > > > Found FMA > > > > > > > > Found SSE4.1 > > > > > > > > Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 > > > > > > > > liblz4/1.9.3 libzstd/1.5.2 > > > > > > > > Found libcurl/7.79.1 SecureTransport (LibreSSL/3.3.6) > > > > > > > > zlib/1.2.11 nghttp2/1.45.1 > > > > > > > > > > > > > > > > > > > > > > > > What I'm missing here? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David
