Cc’ing Tika... On Tue, May 18, 2021 at 6:16 AM Gaétan QUENTIN@Work < work.gaetan.quen...@gmail.com> wrote:
> Hi, > > > 2 Questions: > > -------------- > > > While indexing a directory full of docs (pdf,odt etc ..) i see with a > 'top' command that tesseract / imagick are well launched on some docs > and are not on others, even if every docs contains images. > > More, none of the scanned documents by tesseract are indexed: the search > indexes stay empty > > The ones (images too) for which tesseract has not be launched are > indexed with a single line in 'content' field: \n \n \n ....etc > > The tesseract configuration i set is well taken into account by solr. > > Manualy launching tesseract do good work althought. > > > So: > > - how to trace what is going wrong with the whole process? I have > swithed log level to debug on everything and don't see anything > > - what are the conditions for tesseract to be launched? if a pdf > contains a text line + image, should tesseract be launched? > > > Below is my environment: > > Environment: > > -------------------- > > ubuntu 20.04 (lxd container) > > openjdk 16 > > solr 8.8.2 > > core in standalone mode, with sample_techproducts_configs config > > tesseract 4.1.1 with all langs > > tesseract conf: > > > /opt/solr/server/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties: > > # Tesseract properties > tesseractPath= > language=fra > pageSegMode=1 > maxFileSizeToOcr=2147483647 > minFileSizeToOcr=0 > timeout=120 > #txt or hocr > outputType=txt > preserveInterwordSpacing=false > > # properties for image processing > # to enable processing, set enableImageProcessing to 1 > enableImageProcessing=1 > ImageMagickPath= > density=300 > depth=4 > colorspace=gray > filter=triangle > resize=900 > applyRotation=false > > imagemagick 6.9.10 > > > > Regards, > > Gaétan > >