Cc’ing Tika...

On Tue, May 18, 2021 at 6:16 AM Gaétan QUENTIN@Work <
work.gaetan.quen...@gmail.com> wrote:

> Hi,
>
>
> 2 Questions:
>
> --------------
>
>
> While indexing a directory full of docs (pdf,odt etc ..) i see with a
> 'top' command that tesseract / imagick are well launched on some docs
> and are not on others, even if every docs contains images.
>
> More, none of the scanned documents by tesseract are indexed: the search
> indexes stay empty
>
> The ones (images too) for which tesseract has not be launched are
> indexed with a single line in 'content' field: \n \n \n ....etc
>
> The tesseract configuration i set is well taken into account by solr.
>
> Manualy launching tesseract do good work althought.
>
>
> So:
>
>   - how to trace what is going wrong with the whole process? I have
> swithed log level to debug on everything and don't see anything
>
> - what are the conditions for tesseract to be launched? if a pdf
> contains a text line + image, should tesseract be launched?
>
>
> Below is my environment:
>
>   Environment:
>
> --------------------
>
> ubuntu 20.04 (lxd container)
>
> openjdk 16
>
> solr 8.8.2
>
> core in standalone mode, with sample_techproducts_configs config
>
> tesseract 4.1.1 with all langs
>
> tesseract conf:
>
>
> /opt/solr/server/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties:
>
> # Tesseract properties
> tesseractPath=
> language=fra
> pageSegMode=1
> maxFileSizeToOcr=2147483647
> minFileSizeToOcr=0
> timeout=120
> #txt or hocr
> outputType=txt
> preserveInterwordSpacing=false
>
> # properties for image processing
> # to enable processing, set enableImageProcessing to 1
> enableImageProcessing=1
> ImageMagickPath=
> density=300
> depth=4
> colorspace=gray
> filter=triangle
> resize=900
> applyRotation=false
>
> imagemagick 6.9.10
>
>
>
> Regards,
>
> Gaétan
>
>

Reply via email to