Hi,

2 Questions:

--------------


While indexing a directory full of docs (pdf,odt etc ..) i see with a 'top' command that tesseract / imagick are well launched on some docs and are not on others, even if every docs contains images.

More, none of the scanned documents by tesseract are indexed: the search indexes stay empty

The ones (images too) for which tesseract has not be launched are indexed with a single line in 'content' field: \n \n \n ....etc

The tesseract configuration i set is well taken into account by solr.

Manualy launching tesseract do good work althought.


So:

 - how to trace what is going wrong with the whole process? I have swithed log level to debug on everything and don't see anything

- what are the conditions for tesseract to be launched? if a pdf contains a text line + image, should tesseract be launched?


Below is my environment:

 Environment:

--------------------

ubuntu 20.04 (lxd container)

openjdk 16

solr 8.8.2

core in standalone mode, with sample_techproducts_configs config

tesseract 4.1.1 with all langs

tesseract conf:

/opt/solr/server/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:

# Tesseract properties
tesseractPath=
language=fra
pageSegMode=1
maxFileSizeToOcr=2147483647
minFileSizeToOcr=0
timeout=120
#txt or hocr
outputType=txt
preserveInterwordSpacing=false

# properties for image processing
# to enable processing, set enableImageProcessing to 1
enableImageProcessing=1
ImageMagickPath=
density=300
depth=4
colorspace=gray
filter=triangle
resize=900
applyRotation=false

imagemagick 6.9.10



Regards,

Gaétan

Reply via email to