Wait, no, the clone method won't work, because, again, you'd have to
find the TesseractOCRParser in order to call clone, which is possible,
but annoying.

So, I _guess_ the solution is to keep track of what values were
changed from default in the TesseractOCRConfig.  This will add code
and functionality that no one has ever requested...do we need this?

I'm open to other solutions.

On Mon, Feb 8, 2021 at 9:53 PM Tim Allison <[email protected]> wrote:
>
> >property file and tika-config interact.
>
> If you use a tika-config, the parameters are set from there.  If you
> don't, we fall back to the property file.
>
> If you look in the TesseractOCRParser, there's a "defaultConfig".
> That is intended to be loaded and configured shortly after
> initialization and is to be used as the default config if the user
> does not otherwise pass in an OCRConfig at parse time.  That
> "internal" config is effectively static and can be used across threads
> because, under normal circumstances, it is never changed shortly after
> initialization.  As above, it is either set by a tika-config file or
> by the properties shortly after initialization.
>
> > you can programmatically change parameter values
> Yes. If you programmatically call the setters on the parser, that will
> change the underlying defaultConfig...as you'd expect. And those
> changes will go into effect across all threads for that parser.  You
> will only change the values that you call. Everything else from the
> original initialization will be unchanged.  There's no great way to
> find the parser in the AutoDetectParser...  So, basically, don't do
> this.
>
> >or pass in a tika-config to the parser which is set in the parseContext, 
> >right.
> Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
> internal default config that was set as described above.  If you then
> pass in a new tessconfig at parse time via the parsecontext, it will
> use that _instead_ of the internal config that was set shortly after
> initialization.
>
> If you want to add a "clone" method or similar or a "getConfig" to
> TesseractParser, that might work.  You'd get the default tessconfig
> (which was set via the tika-config file at initialization), clone it,
> modify it and then send it into a given parse at parse time via the
> ParseContext.  Something like that should work.
>
> As our code is currently set up... (e.g. I acknowledge there is always
> room for improvement), et's say the parameter is dpi, and the default
> is 100.
>
> If you set "dpi" to 200 in your tika-config.xml file, then the
> internal tessconfig will be 200.  Now let's say at parse time, you
> want to go back to the default...so you set dpi on a new tessconfig to
> 100 and then send that in via the parsecontext.  We don't currently
> have the code in place to know that you only changed one parameter in
> the tessconfig.  So, how would we know to overwrite that one value,
> but not say the empty path to tesseract or any of the other default
> values.
>
>
> On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
> <[email protected]> wrote:
> >
> > I still don't get how the property file and tika-config interact.  When you 
> > say an internal tessConfig I assume you mean the one that is packaged with 
> > tika, which could be replaced by another file in the same package (which is 
> > essentially what I'm doing now)
> >
> > Then, at runtime, you can programmatically change parameter values or pass 
> > in a tika-config to the parser which is set in the parseContext, right.  So 
> > wouldn't that simply override any values in the current config?  I don't 
> > understand how this would cause the default values to re-appear
> >
> > ________________________________
> > From: Tim Allison <[email protected]>
> > Sent: Monday, February 8, 2021 9:25 PM
> > To: Peter Kronenberg <[email protected]>
> > Cc: [email protected] <[email protected]>
> > Subject: Re: Tika-config
> >
> > sorry an "internal tessconfig"
> >
> > On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <[email protected]> wrote:
> > >
> > > Let's say you have an internal tessconfig file in the parser that
> > > you've configured through a tikaconfig.  When, at runtime, you send in
> > > a new tessconfig via the parsecontext, how can we tell which
> > > parameters you want to change from the new tessconfig?
> > >
> > > Yes, I realize that it would be possible to keep track of what
> > > parameters have been changed in the runtime config and then do
> > > something smart, but this hasn't been an issue to date.
> > >
> > > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > > <[email protected]> wrote:
> > > >
> > > > Ok, your last point might be the issue. If i don't set it in 
> > > > tesseractOCRConfig, then seting it in tika-config has no effect? I'm 
> > > > not sure I understand the thinking or logic behind this.
> > > >
> > > >
> > > > ________________________________
> > > > From: Tim Allison <[email protected]>
> > > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > > To: Peter Kronenberg <[email protected]>; [email protected] 
> > > > <[email protected]>
> > > > Subject: Re: Tika-config
> > > >
> > > > I regret that I'm not able to reproduce this...that is, this works for 
> > > > me;
> > > >
> > > > @Test
> > > > public void oneOff() throws Exception {
> > > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > > >     TikaConfig config = new TikaConfig();
> > > >     AutoDetectParser parser = new AutoDetectParser(config);
> > > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", 
> > > > parser).xml);
> > > > }
> > > >
> > > >
> > > > where myconfig.xml is:
> > > > <?xml version="1.0" encoding="UTF-8"?>
> > > > <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" type="string">C:\Program 
> > > > Files\Tesseract-OCR2</param>
> > > >                 <param name="tessdataPath" type="string">C:\Program 
> > > > Files\Tesseract-OCR2\tessdata</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > > Whatever you set in your tessConfig will _override_ the underlying 
> > > > settings of the parser...all of them.  So, if you aren't setting the 
> > > > path there, then, y, you won't see any effect.
> > > >
> > > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg 
> > > > <[email protected]> wrote:
> > > >
> > > > Like this.
> > > >
> > > >
> > > >         TikaConfig tikaConfig = new TikaConfig();
> > > >
> > > >         final AutoDetectParser parser = new 
> > > > AutoDetectParser(tikaConfig);
> > > >
> > > >         final ParseContext parseContext = new ParseContext();
> > > >
> > > >         parseContext.set(AutoDetectParser.class, parser);
> > > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > > >
> > > > -----Original Message-----
> > > > From: Tim Allison <[email protected]>
> > > > Sent: Monday, February 8, 2021 5:31 PM
> > > > To: [email protected]
> > > > Subject: Re: Tika-config
> > > >
> > > > How are you using the TikaConfig?
> > > >
> > > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg 
> > > > <[email protected]> wrote:
> > > > >
> > > > > What is wrong with this?
> > > > >
> > > > > I specified the tika-config env variable.  I know it works because if
> > > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > > finding the file.  But it’s not applying the properties
> > > > >
> > > > >
> > > > >
> > > > > I have this tika-config.  I tried forward slashes instead of the 
> > > > > double backslashes.  Same result.  No errors.  It’s just not applying 
> > > > > the values.
> > > > >
> > > > >
> > > > >
> > > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > > >     <parsers>
> > > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > > >         </parser>
> > > > >
> > > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > > >             <params>
> > > > >                 <param name="tesseractPath" 
> > > > > type="string">c:\\tesseract_config</param>
> > > > >                 <param name="tessdataPath" 
> > > > > type="string">c:\\tessdata_config</param>
> > > > >             </params>
> > > > >         </parser>
> > > > >     </parsers>
> > > > > </properties>
> > > > >
> > > > >

Reply via email to