Hi Arcadius,
So, to be clear, the repository connection you are using is a web
connection type?
The web connector has the following code which should prevent indexing of
any content that was received with a response type of 200:
int responseCode = cache.getResponseCode(documentIdentifier);
if (responseCode != 200)
{
if (Logging.connectors.isDebugEnabled())
Logging.connectors.debug("Web: For document
'"+documentIdentifier+"', not indexing because response code not indexable:
"+responseCode);
errorCode = "RESPONSECODENOTINDEXABLE";
errorDesc = "HTTP response code not indexable ("+responseCode+")";
activities.noDocument(documentIdentifier,versionString);
return;
}
You should indeed see these cases logged in the simple history and no
document sent to Solr. Is this not what you are seeing?
Karl
On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <[email protected]>
wrote:
>
> Hello.
>
> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>
> MCF has ingested into Solr documents that returned HTTP error let's says
> 401, 403, 404 or have a certain content like "this page has expired and has
> been removed"
>
> The question is:
> is there a way to tell MCF to ingest
> - only document not containing a certain content like "Not Found" or
> - only document excluding those with header 401, 403, 404, 500, ...
>
> Thank you very much.
>
> Arcadius.
>