Re: Mapping Webcrawler metadata

Karl Wright Mon, 14 Jul 2014 06:03:08 -0700

Hi Andrea,

The web crawler connector sends along all HTTP header values EXCEPT for
certain explicitly excluded ones as metadata.  The excluded headers are
those which are involved in authorization or which would change on every
fetch.

The kinds of metadata you list above seems to not be coming from the web
connector, but rather from Solr Cell (Tika), which is the extracting update
handler in Solr.  I have no idea what Tika can possibly generate.  The Tika
generated metadata fields cannot be mapped using the Solr Field Mapping tab
because that extraction takes place in Solr, not in ManifoldCF.

MCF 1.7 will have the option of running Tika locally in MCF, as a
transformation connector, and not using Solr's extracting update handler,
so you should have better control when 1.7 is released.

Thanks,
Karl

On Mon, Jul 14, 2014 at 7:16 AM, Andrea Piemontese <[email protected]>
wrote:

> Hi All,
>
> I'm trying to map which informations/metadata will be extracted by the
> WebcrawlerConnector to be imported and indexed by the SolrConnector.
>
> Executing a Job with WebcrawlerConnector as input and SolrConnector as
> output, the metadata I get in SolR are the following:
>
> - links
> - id
> - author
> - authors
> - title
> - content_type
> - resourcename
> - content
> - _version_
>
> Is there a way to know which metadata are extracted by the
> WebcrawlerConnector?
> In other words, which metadata can I use in the "Solr Field Mapping"
> tab of the job configuration?
>
> Thanks a lot in advance.
>

Re: Mapping Webcrawler metadata

Reply via email to