Hi,
2 Questions:
--------------
While indexing a directory full of docs (pdf,odt etc ..) i see with a
'top' command that tesseract / imagick are well launched on some docs
and are not on others, even if every docs contains images.
More, none of the scanned documents by tesseract are indexed: the search
indexes stay empty
The ones (images too) for which tesseract has not be launched are
indexed with a single line in 'content' field: \n \n \n ....etc
The tesseract configuration i set is well taken into account by solr.
Manualy launching tesseract do good work althought.
So:
- how to trace what is going wrong with the whole process? I have
swithed log level to debug on everything and don't see anything
- what are the conditions for tesseract to be launched? if a pdf
contains a text line + image, should tesseract be launched?
Below is my environment:
Environment:
--------------------
ubuntu 20.04 (lxd container)
openjdk 16
solr 8.8.2
core in standalone mode, with sample_techproducts_configs config
tesseract 4.1.1 with all langs
tesseract conf:
/opt/solr/server/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:
# Tesseract properties
tesseractPath=
language=fra
pageSegMode=1
maxFileSizeToOcr=2147483647
minFileSizeToOcr=0
timeout=120
#txt or hocr
outputType=txt
preserveInterwordSpacing=false
# properties for image processing
# to enable processing, set enableImageProcessing to 1
enableImageProcessing=1
ImageMagickPath=
density=300
depth=4
colorspace=gray
filter=triangle
resize=900
applyRotation=false
imagemagick 6.9.10
Regards,
Gaétan