FWIW, I now understand what I was missing that made me think Manifold was running TIka when it wasn't. It turns out that Alfresco uses Tika internally and when you get a document from Alfresco (using the Webscripts connector anyway) the set of fields you get includes all the image metadata and what-not (for image files). I never realized this because I don't typically use Alfresco for images. But when I added extra logging to the Alfresco WebScripts connector code, to spit out the incoming field set, I see things like:
Found property exif:yResolution = 72.0 Found property cm:owner = admin Found property exif:isoSpeedRatings = 400 Found property exif:fNumber = 3.5 Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b Found property exif:pixelYDimension = 2048 Found property exif:resolutionUnit = Inch Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z Found property sys:locale = en_GB which explains why the Solr connector was trying to save fields like exif_fNumber and exif_resolutionUnit. This came up because the Alfresco instance I'm experimenting with has their default sample workspace which includes images and things I don't normally touch. :-) As for managing all this so my history doesn't contain all those failure messages, I thought about creating a "WhitelistFieldTransform" as a transform connection to drop any fields other than the ones that are whitelisted. Two questions: 1. Does this seem like a reasonable approach, or is there a better way? 2. If this is reasonable and I create such a filter, would there be any interest in having it contributed back to MCF? Cheers, Phil This message optimized for indexing by NSA PRISM On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <[email protected]> wrote: > Hi Phil, > > In most cases you can't modify the fields being output by the various > connectors, but you don't have to use them. If you have an output connector > that *insists* on using all of them in a destructive way, we'd like to know > about that. Usually extra fields are harmless and only the ones you want in > your schema are looked for. > > Karl > > > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <[email protected]> > wrote: >> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <[email protected]> wrote: >> > Hi Phil, >> > >> > You are correct in asserting that in MCF it is the sum total of all the >> > connections that the document passes through that determine its >> > attribute >> > set. That includes transformation connections as well as the repository >> > connection. >> >> OK, sounds good. >> >> > Tika is one connection that does add a lot of fields and these depend >> > not >> > only on the configuration of the Tika connection, but also on the kind >> > of >> > document being extracted. If you want to figure out the sum total of >> > what's >> > possible, you will need to consult the Tika documentation. And yes, the >> > field names Tika generates are created based on what Tika finds in the >> > document. >> >> Gotcha. So if I want to limit the fields output to *only* a specific >> set that is determined in advance, is there a way to accomplish that? >> >> > Alternatively, you can configure your job to send output to a null >> > output >> > connection. This connection records all attribute information for each >> > document in the simple history, so you can get an idea what to expect. >> >> Excellent, I'll investigate that. >> >> > I'm a little confused about your statement that Tika runs even when it's >> > not >> > in a job's pipeline. That's not actually true, so I'm wondering what >> > you >> > are seeing. >> >> It's probable that I'm wrong. I just thought maybe there was some >> default behavior, because I pointed MCF at a directory full of PDF's >> without explicitly configuring Tika and I saw fields in the output >> that I thought were probably generated by Tika. Likewise now I am >> running a pipeline with no explicit Tika step and I see output fields >> for EXIF stuff for images and the like, which I assumed came from >> Tika. >> >> >> >> Phil > >
