Re: How to determine the set of all possible fields in MCF output?

Phillip Rhodes Sat, 14 Oct 2017 17:12:47 -0700

On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <[email protected]> wrote:
> Hi Phil,
>
> You are correct in asserting that in MCF it is the sum total of all the
> connections that the document passes through that determine its attribute
> set.  That includes transformation connections as well as the repository
> connection.


OK, sounds good.

> Tika is one connection that does add a lot of fields and these depend not
> only on the configuration of the Tika connection, but also on the kind of
> document being extracted.  If you want to figure out the sum total of what's
> possible, you will need to consult the Tika documentation.  And yes, the
> field names Tika generates are created based on what Tika finds in the
> document.

Gotcha.   So if I want to limit the fields output to *only* a specific
set that is determined in advance, is there a way to accomplish that?

> Alternatively, you can configure your job to send output to a null output
> connection.  This connection records all attribute information for each
> document in the simple history, so you can get an idea what to expect.

Excellent, I'll investigate that.

> I'm a little confused about your statement that Tika runs even when it's not
> in a job's pipeline.  That's not actually true, so I'm wondering what you
> are seeing.

It's probable that I'm wrong.  I just thought maybe there was some
default behavior, because I pointed MCF at a directory full of PDF's
without explicitly configuring Tika and I saw fields in the output
that I thought were probably generated by Tika.  Likewise now I am
running a pipeline with no explicit Tika step and I see output fields
for EXIF stuff for images and the like, which I assumed came from
Tika.



Phil

Re: How to determine the set of all possible fields in MCF output?

Reply via email to