solr 8.8.2 / tesseract: launched but no document indexed

Gaétan QUENTIN Tue, 18 May 2021 03:16:01 -0700

Hi,


2 Questions:

--------------

While indexing a directory full of docs (pdf,odt etc ..) i see with a'top' command that tesseract / imagick are well launched on some docsand are not on others, even if every docs contains images.

More, none of the scanned documents by tesseract are indexed: the searchindexes stay empty

The ones (images too) for which tesseract has not be launched areindexed with a single line in 'content' field: \n \n \n ....etc


The tesseract configuration i set is well taken into account by solr.

Manualy launching tesseract do good work althought.


So:

- how to trace what is going wrong with the whole process? I haveswithed log level to debug on everything and don't see anything

- what are the conditions for tesseract to be launched? if a pdfcontains a text line + image, should tesseract be launched?



Below is my environment:

 Environment:

--------------------

ubuntu 20.04 (lxd container)

openjdk 16

solr 8.8.2

core in standalone mode, with sample_techproducts_configs config

tesseract 4.1.1 with all langs

tesseract conf:

/opt/solr/server/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:

# Tesseract properties
tesseractPath=
language=fra
pageSegMode=1
maxFileSizeToOcr=2147483647
minFileSizeToOcr=0
timeout=120
#txt or hocr
outputType=txt
preserveInterwordSpacing=false

# properties for image processing
# to enable processing, set enableImageProcessing to 1
enableImageProcessing=1
ImageMagickPath=
density=300
depth=4
colorspace=gray
filter=triangle
resize=900
applyRotation=false

imagemagick 6.9.10



Regards,

Gaétan

solr 8.8.2 / tesseract: launched but no document indexed

Reply via email to