On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <[email protected]> wrote: > Well, there are some differences; "Solr Cell" (as they used to call it) > generates a couple of fields that the standard Tika extractor in MCF won't. > But other than that it should work.
By and large I don't think I care about those fields, so that part shouldn't be an issue. > Note that you can still use the extracting update handler in the solr > connector; since the input will always be text/plain Tika shouldn't do > anything to the document on the Solr side. If that doesn't happen to be > true, you can use the standard Solr input handler, FWIW, it appears that even when using the Tika connector in MCF, what gets sent to Solr still triggers some Tika behavior if you have the "use extract handler" option turned on. When I did this I got all sorts of weird Tika parse exceptions and what-not from Solr. Fortunately just sending everything to Solr using the standard handler worked and I'm at a point now where *almost* everything works. The one issue I'm still seeing is this: when using the Tika connector, it seems that some date oriented fields are being generated with a value that does not have the trailing 'Z` timezone flag. This causes a Solr error if the corresponding field is date typed, as Solr requires dates to be in that UTC timezone. Ex: dcterms:created: 2011-03-02T08:44:45 found field: dcterms:modified: 2011-03-02T08:44:45 Last-Save-Date: 2011-03-02T08:44:45 meta:save-date: 2011-03-02T08:44:45 Solr wants all of thse to look like 2011-03-02T08:44:45Z Is there any way, using any built in MCF functionality, to forcibly munge the field values to correct this? If not, could I accomplish that by writing a custom Transform connector? Thanks, Phil
