Hi Andrea, The web crawler connector sends along all HTTP header values EXCEPT for certain explicitly excluded ones as metadata. The excluded headers are those which are involved in authorization or which would change on every fetch.
The kinds of metadata you list above seems to not be coming from the web connector, but rather from Solr Cell (Tika), which is the extracting update handler in Solr. I have no idea what Tika can possibly generate. The Tika generated metadata fields cannot be mapped using the Solr Field Mapping tab because that extraction takes place in Solr, not in ManifoldCF. MCF 1.7 will have the option of running Tika locally in MCF, as a transformation connector, and not using Solr's extracting update handler, so you should have better control when 1.7 is released. Thanks, Karl On Mon, Jul 14, 2014 at 7:16 AM, Andrea Piemontese <[email protected]> wrote: > Hi All, > > I'm trying to map which informations/metadata will be extracted by the > WebcrawlerConnector to be imported and indexed by the SolrConnector. > > Executing a Job with WebcrawlerConnector as input and SolrConnector as > output, the metadata I get in SolR are the following: > > - links > - id > - author > - authors > - title > - content_type > - resourcename > - content > - _version_ > > Is there a way to know which metadata are extracted by the > WebcrawlerConnector? > In other words, which metadata can I use in the "Solr Field Mapping" > tab of the job configuration? > > Thanks a lot in advance. >
