On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <[email protected]> wrote: > Hi Phil, > > You are correct in asserting that in MCF it is the sum total of all the > connections that the document passes through that determine its attribute > set. That includes transformation connections as well as the repository > connection.
OK, sounds good. > Tika is one connection that does add a lot of fields and these depend not > only on the configuration of the Tika connection, but also on the kind of > document being extracted. If you want to figure out the sum total of what's > possible, you will need to consult the Tika documentation. And yes, the > field names Tika generates are created based on what Tika finds in the > document. Gotcha. So if I want to limit the fields output to *only* a specific set that is determined in advance, is there a way to accomplish that? > Alternatively, you can configure your job to send output to a null output > connection. This connection records all attribute information for each > document in the simple history, so you can get an idea what to expect. Excellent, I'll investigate that. > I'm a little confused about your statement that Tika runs even when it's not > in a job's pipeline. That's not actually true, so I'm wondering what you > are seeing. It's probable that I'm wrong. I just thought maybe there was some default behavior, because I pointed MCF at a directory full of PDF's without explicitly configuring Tika and I saw fields in the output that I thought were probably generated by Tika. Likewise now I am running a pipeline with no explicit Tika step and I see output fields for EXIF stuff for images and the like, which I assumed came from Tika. Phil
