Ok, your last point might be the issue. If i don't set it in tesseractOCRConfig, then seting it in tika-config has no effect? I'm not sure I understand the thinking or logic behind this.
________________________________ From: Tim Allison <[email protected]> Sent: Monday, February 8, 2021 8:47:07 PM To: Peter Kronenberg <[email protected]>; [email protected] <[email protected]> Subject: Re: Tika-config I regret that I'm not able to reproduce this...that is, this works for me; @Test public void oneOff() throws Exception { System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml"); TikaConfig config = new TikaConfig(); AutoDetectParser parser = new AutoDetectParser(config); assertContains("quick brown fox", getXML("testOCR_spacing.png", parser).xml); } where myconfig.xml is: <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> </parser> <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> <params> <param name="tesseractPath" type="string">C:\Program Files\Tesseract-OCR2</param> <param name="tessdataPath" type="string">C:\Program Files\Tesseract-OCR2\tessdata</param> </params> </parser> </parsers> </properties> Whatever you set in your tessConfig will _override_ the underlying settings of the parser...all of them. So, if you aren't setting the path there, then, y, you won't see any effect. On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: Like this. TikaConfig tikaConfig = new TikaConfig(); final AutoDetectParser parser = new AutoDetectParser(tikaConfig); final ParseContext parseContext = new ParseContext(); parseContext.set(AutoDetectParser.class, parser); parseContext.set(PDFParserConfig.class, pdfConfig); parseContext.set(TesseractOCRConfig.class, tessConfig); -----Original Message----- From: Tim Allison <[email protected]<mailto:[email protected]>> Sent: Monday, February 8, 2021 5:31 PM To: [email protected]<mailto:[email protected]> Subject: Re: Tika-config How are you using the TikaConfig? On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <[email protected]<mailto:[email protected]>> wrote: > > What is wrong with this? > > I specified the tika-config env variable. I know it works because if > I make a syntax error in the tika-config.xml, it complains. So it’s > finding the file. But it’s not applying the properties > > > > I have this tika-config. I tried forward slashes instead of the double > backslashes. Same result. No errors. It’s just not applying the values. > > > > <?xml version="1.0" encoding="UTF-8"?> <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"> > </parser> > > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> > <params> > <param name="tesseractPath" > type="string">c:\\tesseract_config</param> > <param name="tessdataPath" > type="string">c:\\tessdata_config</param> > </params> > </parser> > </parsers> > </properties> > >
