As far as I know, the wonkiness with the data I'm seeing is actually a reflection of an underlying problem with digital images. Apparently some or all of the various date typed fields mandated by EXIF and XMP don't require time-zone information. So apparently you can have an image that legitimately has a date/time field like "created date" that does not include time-zone info. But since Solr requires UTC time-zone for date typed fields, if you want to store that date in a date field, you have to impute the correct value (or a reasonable approximation).
In my case, I doubt anybody is ever going to care to search images in a way where a difference of a few hours is going to matter, so I think I'm just going to force everything to a time value of midnight UTC on the date in question. Right now I'm exploring writing my own custom transformer to do the data munging. It might be overkill, but I wanted to do it just to learn that side of MCF if nothing else. So far the transformer I threw together seems to be working. Thanks, Phil This message optimized for indexing by NSA PRISM On Fri, Dec 22, 2017 at 7:27 PM, Karl Wright <[email protected]> wrote: > Hi Phil, > > Are these fields extracted by Tika from your document? Just curious, > because if it's in MCF itself we could do something about it. > > Anyhow, what you want is the metadata adjuster: > > https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#metadataadjuster > > > Karl > > > On Fri, Dec 22, 2017 at 1:47 AM, Phillip Rhodes <[email protected]> > wrote: >> >> On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <[email protected]> wrote: >> > Well, there are some differences; "Solr Cell" (as they used to call it) >> > generates a couple of fields that the standard Tika extractor in MCF >> > won't. >> > But other than that it should work. >> >> By and large I don't think I care about those fields, so that part >> shouldn't be an issue. >> >> > Note that you can still use the extracting update handler in the solr >> > connector; since the input will always be text/plain Tika shouldn't do >> > anything to the document on the Solr side. If that doesn't happen to be >> > true, you can use the standard Solr input handler, >> >> FWIW, it appears that even when using the Tika connector in MCF, what >> gets sent to >> Solr still triggers some Tika behavior if you have the "use extract >> handler" option turned on. >> When I did this I got all sorts of weird Tika parse exceptions and >> what-not from Solr. >> >> Fortunately just sending everything to Solr using the standard handler >> worked and I'm >> at a point now where *almost* everything works. >> >> The one issue I'm still seeing is this: when using the Tika >> connector, it seems that some date oriented >> fields are being generated with a value that does not have the >> trailing 'Z` timezone flag. This causes >> a Solr error if the corresponding field is date typed, as Solr >> requires dates to be in that UTC timezone. >> >> Ex: >> >> dcterms:created: 2011-03-02T08:44:45 >> found field: dcterms:modified: 2011-03-02T08:44:45 >> Last-Save-Date: 2011-03-02T08:44:45 >> meta:save-date: 2011-03-02T08:44:45 >> >> Solr wants all of thse to look like >> >> >> 2011-03-02T08:44:45Z >> >> >> Is there any way, using any built in MCF functionality, to forcibly >> munge the field values to correct this? If not, could I accomplish >> that by writing a custom Transform connector? >> >> >> Thanks, >> >> >> Phil > >
