I'm gonna have to read this again slowly.
. But you corrected me when I said that tika-config is set in the parser 
context. But doesn't it get passed on by virtue of being set on the autodetect 
parser?

I haven't looked at the code yet to see the defaultConfig, but isn't the 
internal tesseractOCRConfig always used? Isn't that always the default?

________________________________
From: Tim Allison <[email protected]>
Sent: Monday, February 8, 2021 9:53 PM
To: [email protected] <[email protected]>
Subject: Re: Tika-config

>property file and tika-config interact.

If you use a tika-config, the parameters are set from there.  If you
don't, we fall back to the property file.

If you look in the TesseractOCRParser, there's a "defaultConfig".
That is intended to be loaded and configured shortly after
initialization and is to be used as the default config if the user
does not otherwise pass in an OCRConfig at parse time.  That
"internal" config is effectively static and can be used across threads
because, under normal circumstances, it is never changed shortly after
initialization.  As above, it is either set by a tika-config file or
by the properties shortly after initialization.

> you can programmatically change parameter values
Yes. If you programmatically call the setters on the parser, that will
change the underlying defaultConfig...as you'd expect. And those
changes will go into effect across all threads for that parser.  You
will only change the values that you call. Everything else from the
original initialization will be unchanged.  There's no great way to
find the parser in the AutoDetectParser...  So, basically, don't do
this.

>or pass in a tika-config to the parser which is set in the parseContext, right.
Uh, TesseractOCRConfig, right? At parse time, the OCRParser has an
internal default config that was set as described above.  If you then
pass in a new tessconfig at parse time via the parsecontext, it will
use that _instead_ of the internal config that was set shortly after
initialization.

If you want to add a "clone" method or similar or a "getConfig" to
TesseractParser, that might work.  You'd get the default tessconfig
(which was set via the tika-config file at initialization), clone it,
modify it and then send it into a given parse at parse time via the
ParseContext.  Something like that should work.

As our code is currently set up... (e.g. I acknowledge there is always
room for improvement), et's say the parameter is dpi, and the default
is 100.

If you set "dpi" to 200 in your tika-config.xml file, then the
internal tessconfig will be 200.  Now let's say at parse time, you
want to go back to the default...so you set dpi on a new tessconfig to
100 and then send that in via the parsecontext.  We don't currently
have the code in place to know that you only changed one parameter in
the tessconfig.  So, how would we know to overwrite that one value,
but not say the empty path to tesseract or any of the other default
values.


On Mon, Feb 8, 2021 at 9:35 PM Peter Kronenberg
<[email protected]> wrote:
>
> I still don't get how the property file and tika-config interact.  When you 
> say an internal tessConfig I assume you mean the one that is packaged with 
> tika, which could be replaced by another file in the same package (which is 
> essentially what I'm doing now)
>
> Then, at runtime, you can programmatically change parameter values or pass in 
> a tika-config to the parser which is set in the parseContext, right.  So 
> wouldn't that simply override any values in the current config?  I don't 
> understand how this would cause the default values to re-appear
>
> ________________________________
> From: Tim Allison <[email protected]>
> Sent: Monday, February 8, 2021 9:25 PM
> To: Peter Kronenberg <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: Tika-config
>
> sorry an "internal tessconfig"
>
> On Mon, Feb 8, 2021 at 9:23 PM Tim Allison <[email protected]> wrote:
> >
> > Let's say you have an internal tessconfig file in the parser that
> > you've configured through a tikaconfig.  When, at runtime, you send in
> > a new tessconfig via the parsecontext, how can we tell which
> > parameters you want to change from the new tessconfig?
> >
> > Yes, I realize that it would be possible to keep track of what
> > parameters have been changed in the runtime config and then do
> > something smart, but this hasn't been an issue to date.
> >
> > On Mon, Feb 8, 2021 at 9:15 PM Peter Kronenberg
> > <[email protected]> wrote:
> > >
> > > Ok, your last point might be the issue. If i don't set it in 
> > > tesseractOCRConfig, then seting it in tika-config has no effect? I'm not 
> > > sure I understand the thinking or logic behind this.
> > >
> > >
> > > ________________________________
> > > From: Tim Allison <[email protected]>
> > > Sent: Monday, February 8, 2021 8:47:07 PM
> > > To: Peter Kronenberg <[email protected]>; [email protected] 
> > > <[email protected]>
> > > Subject: Re: Tika-config
> > >
> > > I regret that I'm not able to reproduce this...that is, this works for me;
> > >
> > > @Test
> > > public void oneOff() throws Exception {
> > >     System.setProperty("tika.config", "C:\\users\\talli\\myconfig.xml");
> > >     TikaConfig config = new TikaConfig();
> > >     AutoDetectParser parser = new AutoDetectParser(config);
> > >     assertContains("quick brown fox", getXML("testOCR_spacing.png", 
> > > parser).xml);
> > > }
> > >
> > >
> > > where myconfig.xml is:
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <properties>
> > >     <parsers>
> > >         <parser class="org.apache.tika.parser.DefaultParser">
> > >         </parser>
> > >
> > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > >             <params>
> > >                 <param name="tesseractPath" type="string">C:\Program 
> > > Files\Tesseract-OCR2</param>
> > >                 <param name="tessdataPath" type="string">C:\Program 
> > > Files\Tesseract-OCR2\tessdata</param>
> > >             </params>
> > >         </parser>
> > >     </parsers>
> > > </properties>
> > >
> > > Whatever you set in your tessConfig will _override_ the underlying 
> > > settings of the parser...all of them.  So, if you aren't setting the path 
> > > there, then, y, you won't see any effect.
> > >
> > > On Mon, Feb 8, 2021 at 5:35 PM Peter Kronenberg 
> > > <[email protected]> wrote:
> > >
> > > Like this.
> > >
> > >
> > >         TikaConfig tikaConfig = new TikaConfig();
> > >
> > >         final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > >
> > >         final ParseContext parseContext = new ParseContext();
> > >
> > >         parseContext.set(AutoDetectParser.class, parser);
> > >         parseContext.set(PDFParserConfig.class, pdfConfig);
> > >         parseContext.set(TesseractOCRConfig.class, tessConfig);
> > >
> > > -----Original Message-----
> > > From: Tim Allison <[email protected]>
> > > Sent: Monday, February 8, 2021 5:31 PM
> > > To: [email protected]
> > > Subject: Re: Tika-config
> > >
> > > How are you using the TikaConfig?
> > >
> > > On Mon, Feb 8, 2021 at 4:11 PM Peter Kronenberg 
> > > <[email protected]> wrote:
> > > >
> > > > What is wrong with this?
> > > >
> > > > I specified the tika-config env variable.  I know it works because if
> > > > I make a syntax error in the tika-config.xml, it complains.  So it’s
> > > > finding the file.  But it’s not applying the properties
> > > >
> > > >
> > > >
> > > > I have this tika-config.  I tried forward slashes instead of the double 
> > > > backslashes.  Same result.  No errors.  It’s just not applying the 
> > > > values.
> > > >
> > > >
> > > >
> > > > <?xml version="1.0" encoding="UTF-8"?> <properties>
> > > >     <parsers>
> > > >         <parser class="org.apache.tika.parser.DefaultParser">
> > > >         </parser>
> > > >
> > > >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> > > >             <params>
> > > >                 <param name="tesseractPath" 
> > > > type="string">c:\\tesseract_config</param>
> > > >                 <param name="tessdataPath" 
> > > > type="string">c:\\tessdata_config</param>
> > > >             </params>
> > > >         </parser>
> > > >     </parsers>
> > > > </properties>
> > > >
> > > >

Reply via email to