Hello Karl. I have checked the Simple History and I could see deletions.
I have recently migrated my config to MCF 2.0.2 without migrating all crawled data. That may be the reason why I have in Solr document that lead to 404. Clearing my Solr index and resetting the crawler may help solve my problem. On the other hand, some of the page I am crawling display friendly messages such as "The document you are looking for has expired" with a 200 HTTP header instead of 404. How feasible would it be to exclude document from the index based on the content on the document? Thank you very much. Arcadius. On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote: > Hi Arcadius, > > So, to be clear, the repository connection you are using is a web > connection type? > > The web connector has the following code which should prevent indexing of > any content that was received with a response type of 200: > > int responseCode = cache.getResponseCode(documentIdentifier); > if (responseCode != 200) > { > if (Logging.connectors.isDebugEnabled()) > Logging.connectors.debug("Web: For document > '"+documentIdentifier+"', not indexing because response code not indexable: > "+responseCode); > errorCode = "RESPONSECODENOTINDEXABLE"; > errorDesc = "HTTP response code not indexable ("+responseCode+")"; > activities.noDocument(documentIdentifier,versionString); > return; > } > > > You should indeed see these cases logged in the simple history and no > document sent to Solr. Is this not what you are seeing? > > Karl > > > On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <[email protected]> > wrote: > >> >> Hello. >> >> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr. >> >> MCF has ingested into Solr documents that returned HTTP error let's says >> 401, 403, 404 or have a certain content like "this page has expired and has >> been removed" >> >> The question is: >> is there a way to tell MCF to ingest >> - only document not containing a certain content like "Not Found" or >> - only document excluding those with header 401, 403, 404, 500, ... >> >> Thank you very much. >> >> Arcadius. >> > > -- Arcadius Ahouansou Menelic Ltd | Information is Power M: 07908761999 W: www.menelic.com ---
