Re: How to determine the set of all possible fields in MCF output?

Phillip Rhodes Tue, 24 Oct 2017 09:29:49 -0700

Hmm... I tried the MetadataAdjuster before and unchecked "Keep all
metadata" and it still seemed to send everything through.   Probably I
just did something wrong... I'll try it again.



Phil

This message optimized for indexing by NSA PRISM


On Tue, Oct 24, 2017 at 3:31 AM, Karl Wright <[email protected]> wrote:
> Hi Phil,
>
> Solr will certainly skip any fields that it doesn't know about and simply
> not save them.  There's little cost to having them pass through MCF; the big
> cost is extraction, which you're stuck with because Alfresco does it no
> matter what.  So I'm not sure what a white-list transformer does for you.
>
> But in any case, there's already a transformer that allows you to map
> metadata around -- the Metadata Adjuster.  See:
>
> http://manifoldcf.apache.org/release/release-2.8.1/en_US/end-user-documentation.html#metadataadjuster
>
> This transformer maps metadata values, allows you to insert new ones, and
> also allows you to ONLY pass through the ones that are explicitly specified
> if you wish.
>
> Thanks,
> Karl
>
>
> On Mon, Oct 23, 2017 at 9:19 PM, Phillip Rhodes <[email protected]>
> wrote:
>>
>> FWIW, I now understand what I was missing that made me think Manifold
>> was running TIka when it wasn't.  It turns out that Alfresco uses Tika
>> internally and when you get a document from Alfresco (using the
>> Webscripts connector anyway) the set of fields you get includes all
>> the image metadata and what-not (for image files).  I never realized
>> this because I don't typically use Alfresco for images.  But when I
>> added extra logging to the Alfresco WebScripts connector code, to spit
>> out the incoming field set, I see things like:
>>
>> Found property exif:yResolution = 72.0
>> Found property cm:owner = admin
>> Found property exif:isoSpeedRatings = 400
>> Found property exif:fNumber = 3.5
>> Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
>> Found property exif:pixelYDimension = 2048
>> Found property exif:resolutionUnit = Inch
>> Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
>> Found property sys:locale = en_GB
>>
>> which explains why the Solr connector was trying to save fields like
>> exif_fNumber and exif_resolutionUnit.   This came up because the
>> Alfresco instance I'm experimenting with has their default sample
>> workspace which includes images and things I don't normally touch.
>> :-)
>>
>> As for managing all this so my history doesn't contain all those
>> failure messages, I thought about creating a "WhitelistFieldTransform"
>> as a transform connection to drop any fields other than the ones that
>> are whitelisted.    Two questions:
>>
>> 1. Does this seem like a reasonable approach, or is there a better way?
>>
>> 2. If this is reasonable and I create such a filter, would there be
>> any interest in having it contributed back to MCF?
>>
>>
>> Cheers,
>>
>>
>> Phil
>>
>> This message optimized for indexing by NSA PRISM
>>
>>
>> On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <[email protected]> wrote:
>> > Hi Phil,
>> >
>> > In most cases you can't modify the fields being output by the various
>> > connectors, but you don't have to use them.  If you have an output
>> > connector
>> > that *insists* on using all of them in a destructive way, we'd like to
>> > know
>> > about that.  Usually extra fields are harmless and only the ones you
>> > want in
>> > your schema are looked for.
>> >
>> > Karl
>> >
>> >
>> > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes
>> > <[email protected]>
>> > wrote:
>> >>
>> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <[email protected]>
>> >> wrote:
>> >> > Hi Phil,
>> >> >
>> >> > You are correct in asserting that in MCF it is the sum total of all
>> >> > the
>> >> > connections that the document passes through that determine its
>> >> > attribute
>> >> > set.  That includes transformation connections as well as the
>> >> > repository
>> >> > connection.
>> >>
>> >> OK, sounds good.
>> >>
>> >> > Tika is one connection that does add a lot of fields and these depend
>> >> > not
>> >> > only on the configuration of the Tika connection, but also on the
>> >> > kind
>> >> > of
>> >> > document being extracted.  If you want to figure out the sum total of
>> >> > what's
>> >> > possible, you will need to consult the Tika documentation.  And yes,
>> >> > the
>> >> > field names Tika generates are created based on what Tika finds in
>> >> > the
>> >> > document.
>> >>
>> >> Gotcha.   So if I want to limit the fields output to *only* a specific
>> >> set that is determined in advance, is there a way to accomplish that?
>> >>
>> >> > Alternatively, you can configure your job to send output to a null
>> >> > output
>> >> > connection.  This connection records all attribute information for
>> >> > each
>> >> > document in the simple history, so you can get an idea what to
>> >> > expect.
>> >>
>> >> Excellent, I'll investigate that.
>> >>
>> >> > I'm a little confused about your statement that Tika runs even when
>> >> > it's
>> >> > not
>> >> > in a job's pipeline.  That's not actually true, so I'm wondering what
>> >> > you
>> >> > are seeing.
>> >>
>> >> It's probable that I'm wrong.  I just thought maybe there was some
>> >> default behavior, because I pointed MCF at a directory full of PDF's
>> >> without explicitly configuring Tika and I saw fields in the output
>> >> that I thought were probably generated by Tika.  Likewise now I am
>> >> running a pipeline with no explicit Tika step and I see output fields
>> >> for EXIF stuff for images and the like, which I assumed came from
>> >> Tika.
>> >>
>> >>
>> >>
>> >> Phil
>> >
>> >
>
>

Re: How to determine the set of all possible fields in MCF output?

Reply via email to