Hmm... I tried the MetadataAdjuster before and unchecked "Keep all metadata" and it still seemed to send everything through. Probably I just did something wrong... I'll try it again.
Phil This message optimized for indexing by NSA PRISM On Tue, Oct 24, 2017 at 3:31 AM, Karl Wright <[email protected]> wrote: > Hi Phil, > > Solr will certainly skip any fields that it doesn't know about and simply > not save them. There's little cost to having them pass through MCF; the big > cost is extraction, which you're stuck with because Alfresco does it no > matter what. So I'm not sure what a white-list transformer does for you. > > But in any case, there's already a transformer that allows you to map > metadata around -- the Metadata Adjuster. See: > > http://manifoldcf.apache.org/release/release-2.8.1/en_US/end-user-documentation.html#metadataadjuster > > This transformer maps metadata values, allows you to insert new ones, and > also allows you to ONLY pass through the ones that are explicitly specified > if you wish. > > Thanks, > Karl > > > On Mon, Oct 23, 2017 at 9:19 PM, Phillip Rhodes <[email protected]> > wrote: >> >> FWIW, I now understand what I was missing that made me think Manifold >> was running TIka when it wasn't. It turns out that Alfresco uses Tika >> internally and when you get a document from Alfresco (using the >> Webscripts connector anyway) the set of fields you get includes all >> the image metadata and what-not (for image files). I never realized >> this because I don't typically use Alfresco for images. But when I >> added extra logging to the Alfresco WebScripts connector code, to spit >> out the incoming field set, I see things like: >> >> Found property exif:yResolution = 72.0 >> Found property cm:owner = admin >> Found property exif:isoSpeedRatings = 400 >> Found property exif:fNumber = 3.5 >> Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b >> Found property exif:pixelYDimension = 2048 >> Found property exif:resolutionUnit = Inch >> Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z >> Found property sys:locale = en_GB >> >> which explains why the Solr connector was trying to save fields like >> exif_fNumber and exif_resolutionUnit. This came up because the >> Alfresco instance I'm experimenting with has their default sample >> workspace which includes images and things I don't normally touch. >> :-) >> >> As for managing all this so my history doesn't contain all those >> failure messages, I thought about creating a "WhitelistFieldTransform" >> as a transform connection to drop any fields other than the ones that >> are whitelisted. Two questions: >> >> 1. Does this seem like a reasonable approach, or is there a better way? >> >> 2. If this is reasonable and I create such a filter, would there be >> any interest in having it contributed back to MCF? >> >> >> Cheers, >> >> >> Phil >> >> This message optimized for indexing by NSA PRISM >> >> >> On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <[email protected]> wrote: >> > Hi Phil, >> > >> > In most cases you can't modify the fields being output by the various >> > connectors, but you don't have to use them. If you have an output >> > connector >> > that *insists* on using all of them in a destructive way, we'd like to >> > know >> > about that. Usually extra fields are harmless and only the ones you >> > want in >> > your schema are looked for. >> > >> > Karl >> > >> > >> > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes >> > <[email protected]> >> > wrote: >> >> >> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <[email protected]> >> >> wrote: >> >> > Hi Phil, >> >> > >> >> > You are correct in asserting that in MCF it is the sum total of all >> >> > the >> >> > connections that the document passes through that determine its >> >> > attribute >> >> > set. That includes transformation connections as well as the >> >> > repository >> >> > connection. >> >> >> >> OK, sounds good. >> >> >> >> > Tika is one connection that does add a lot of fields and these depend >> >> > not >> >> > only on the configuration of the Tika connection, but also on the >> >> > kind >> >> > of >> >> > document being extracted. If you want to figure out the sum total of >> >> > what's >> >> > possible, you will need to consult the Tika documentation. And yes, >> >> > the >> >> > field names Tika generates are created based on what Tika finds in >> >> > the >> >> > document. >> >> >> >> Gotcha. So if I want to limit the fields output to *only* a specific >> >> set that is determined in advance, is there a way to accomplish that? >> >> >> >> > Alternatively, you can configure your job to send output to a null >> >> > output >> >> > connection. This connection records all attribute information for >> >> > each >> >> > document in the simple history, so you can get an idea what to >> >> > expect. >> >> >> >> Excellent, I'll investigate that. >> >> >> >> > I'm a little confused about your statement that Tika runs even when >> >> > it's >> >> > not >> >> > in a job's pipeline. That's not actually true, so I'm wondering what >> >> > you >> >> > are seeing. >> >> >> >> It's probable that I'm wrong. I just thought maybe there was some >> >> default behavior, because I pointed MCF at a directory full of PDF's >> >> without explicitly configuring Tika and I saw fields in the output >> >> that I thought were probably generated by Tika. Likewise now I am >> >> running a pipeline with no explicit Tika step and I see output fields >> >> for EXIF stuff for images and the like, which I assumed came from >> >> Tika. >> >> >> >> >> >> >> >> Phil >> > >> > > >
