I'll have to play with this, but it seems totally counter-intuitive. If I set a parameter in tika-config, but I don't set it in tesseractConfig, then it will ignore my setting and just use the default?
________________________________ From: Tim Allison <[email protected]> Sent: Monday, February 8, 2021, 8:52 PM To: Peter Kronenberg; [email protected] Subject: Re: Tika-config Looks like I forgot to reply to the list on one of your earlier emails. This still holds: bq. One thing to note is that if you set params programmatically, Tika will ignore the default settings that you made in TikaConfig. It will only read the config from what you pass in via the ParseContext. So, if in your tikaconfig.xml you set 'resize' to 100, and then you _don't_ set it in the TesseractConfig that you send in via the ParseContext, it will revert to the overall default of 900 On Mon, Feb 8, 2021 at 8:47 PM Tim Allison <[email protected]> wrote: > > I regret that I'm not able to reproduce this...that is, this works for me; > > @Test > public void oneOff() throws Exception { > System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml"); > TikaConfig config = new TikaConfig(); > AutoDetectParser parser = new AutoDetectParser(config); > assertContains("quick brown fox", getXML("testOCR_spacing.png", > parser).xml); > } > > > where myconfig.xml is: > <?xml version="1.0" encoding="UTF-8"?> > <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"> > </parser> > > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> > <params> > <param name="tesseractPath" type="string">C:\Program > Files\Tesseract-OCR2</param> > <param name="tessdataPath" type="string">C:\Program > Files\Tesseract-OCR2\tessdata</param> > </params> > </parser> > </parsers> > </properties> > > Whatever you set in your tessConfig will _override_ the underlying settings > of the parser...all of them. So, if you aren't setting the path there, then, > y, you won't see any effect. > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg <[email protected]> > wrote: >> >> Like this. >> >> >> TikaConfig tikaConfig = new TikaConfig(); >> >> final AutoDetectParser parser = new AutoDetectParser(tikaConfig); >> >> final ParseContext parseContext = new ParseContext(); >> >> parseContext.set(AutoDetectParser.class, parser); >> parseContext.set(PDFParserConfig.class, pdfConfig); >> parseContext.set(TesseractOCRConfig.class, tessConfig); >> >> -----Original Message----- >> From: Tim Allison <[email protected]> >> Sent: Monday, February 8, 2021 5:31 PM >> To: [email protected] >> Subject: Re: Tika-config >> >> How are you using the TikaConfig? >> >> On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg <[email protected]> >> wrote: >> > >> > What is wrong with this? >> > >> > I specified the tika-config env variable. I know it works because if >> > I make a syntax error in the tika-config.xml, it complains. So it’s >> > finding the file. But it’s not applying the properties >> > >> > >> > >> > I have this tika-config. I tried forward slashes instead of the double >> > backslashes. Same result. No errors. It’s just not applying the values. >> > >> > >> > >> > <?xml version="1.0" encoding="UTF-8"?> <properties> >> > <parsers> >> > <parser class="org.apache.tika.parser.DefaultParser"> >> > </parser> >> > >> > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> >> > <params> >> > <param name="tesseractPath" >> > type="string">c:\\tesseract_config</param> >> > <param name="tessdataPath" >> > type="string">c:\\tessdata_config</param> >> > </params> >> > </parser> >> > </parsers> >> > </properties> >> > >> >
