Hi Arcadius, A feature like this is possible but could be very slow, since there's no definite limit on the size of an html page.
Karl On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <[email protected]> wrote: > > Hello Karl. > > I have checked the Simple History and I could see deletions. > > I have recently migrated my config to MCF 2.0.2 without migrating all > crawled data. That may be the reason why I have in Solr document that lead > to 404. > > Clearing my Solr index and resetting the crawler may help solve my problem. > > On the other hand, some of the page I am crawling display friendly > messages such as "The document you are looking for has expired" with a 200 > HTTP header instead of 404. > How feasible would it be to exclude document from the index based on the > content on the document? > > Thank you very much. > > Arcadius. > > > > On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote: > >> Hi Arcadius, >> >> So, to be clear, the repository connection you are using is a web >> connection type? >> >> The web connector has the following code which should prevent indexing of >> any content that was received with a response type of 200: >> >> int responseCode = cache.getResponseCode(documentIdentifier); >> if (responseCode != 200) >> { >> if (Logging.connectors.isDebugEnabled()) >> Logging.connectors.debug("Web: For document >> '"+documentIdentifier+"', not indexing because response code not indexable: >> "+responseCode); >> errorCode = "RESPONSECODENOTINDEXABLE"; >> errorDesc = "HTTP response code not indexable ("+responseCode+")"; >> activities.noDocument(documentIdentifier,versionString); >> return; >> } >> >> >> You should indeed see these cases logged in the simple history and no >> document sent to Solr. Is this not what you are seeing? >> >> Karl >> >> >> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <[email protected] >> > wrote: >> >>> >>> Hello. >>> >>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr. >>> >>> MCF has ingested into Solr documents that returned HTTP error let's says >>> 401, 403, 404 or have a certain content like "this page has expired and has >>> been removed" >>> >>> The question is: >>> is there a way to tell MCF to ingest >>> - only document not containing a certain content like "Not Found" or >>> - only document excluding those with header 401, 403, 404, 500, ... >>> >>> Thank you very much. >>> >>> Arcadius. >>> >> >> > > > -- > Arcadius Ahouansou > Menelic Ltd | Information is Power > M: 07908761999 > W: www.menelic.com > --- >
