I've created a ticket to continue the discussion about whether we want such a feature and if so what it should look like. CONNECTORS-1193.
Karl On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <[email protected]> wrote: > Hi Arcadius, > > The key question is, how big do you expect the dictionary to become? > > The current algorithm for finding content matches for determining whether > a page is part of a login sequence uses regexps on a line-by-line basis. > This is not ideal because there is no guarantee that the text will have > line breaks, and so it might have to accumulate the entire document in > memory, which is obviously very bad. > > Content matching is currently done within the confines of html; the html > is parsed and only the content portions are matched. Tags are not > checked. If the aho-corasick algorithm is used, it would need to be done > the same way: one line at a time only. > > Karl > > > > On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <[email protected]> > wrote: > >> >> Hello Karl. >> >> I agree, this would be slower than the usual filtering by url or HTTP >> header. >> >> On the other hand, this would be a very useful feature: >> Could be used to remove documents containing swear words from index or >> remove adult content or discard emails flagged as spam etc. >> >> Regarding the implementation. >> So far in MCF, regex have been used for pattern matching. >> In the case of a content filtering, the user will supply a king of >> "dictionary" that we will use to determine whether the document will go >> through or not. >> The dictionary can grow quite a bit. >> >> The other alternative to regex may be the Aho–Corasick string matching >> algorithm >> A java implementation can be found at >> https://github.com/robert-bor/aho-corasick >> Let's say in my dictionary, I have tow entries "expired" and "not found". >> the algorithm will return either "expired", "not found" or both depending >> on what it found in the document. >> This output could be used to decide whether to index it or not. >> >> In this specific case where we only want to exclude a content from the >> index, we could exit on the first match i.e no need to match the whole >> dictionary. >> There is a pull-request for dealing with that >> https://github.com/robert-bor/aho-corasick/pull/14 >> >> Thanks. >> >> Arcadius. >> >> On 29 April 2015 at 22:50, Karl Wright <[email protected]> wrote: >> >>> Hi Arcadius, >>> >>> A feature like this is possible but could be very slow, since there's no >>> definite limit on the size of an html page. >>> >>> Karl >>> >>> >>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou < >>> [email protected]> wrote: >>> >>>> >>>> Hello Karl. >>>> >>>> I have checked the Simple History and I could see deletions. >>>> >>>> I have recently migrated my config to MCF 2.0.2 without migrating all >>>> crawled data. That may be the reason why I have in Solr document that lead >>>> to 404. >>>> >>>> Clearing my Solr index and resetting the crawler may help solve my >>>> problem. >>>> >>>> On the other hand, some of the page I am crawling display friendly >>>> messages such as "The document you are looking for has expired" with a 200 >>>> HTTP header instead of 404. >>>> How feasible would it be to exclude document from the index based on >>>> the content on the document? >>>> >>>> Thank you very much. >>>> >>>> Arcadius. >>>> >>>> >>>> >>>> On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Arcadius, >>>>> >>>>> So, to be clear, the repository connection you are using is a web >>>>> connection type? >>>>> >>>>> The web connector has the following code which should prevent indexing >>>>> of any content that was received with a response type of 200: >>>>> >>>>> int responseCode = cache.getResponseCode(documentIdentifier); >>>>> if (responseCode != 200) >>>>> { >>>>> if (Logging.connectors.isDebugEnabled()) >>>>> Logging.connectors.debug("Web: For document >>>>> '"+documentIdentifier+"', not indexing because response code not >>>>> indexable: >>>>> "+responseCode); >>>>> errorCode = "RESPONSECODENOTINDEXABLE"; >>>>> errorDesc = "HTTP response code not indexable >>>>> ("+responseCode+")"; >>>>> activities.noDocument(documentIdentifier,versionString); >>>>> return; >>>>> } >>>>> >>>>> >>>>> You should indeed see these cases logged in the simple history and no >>>>> document sent to Solr. Is this not what you are seeing? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> Hello. >>>>>> >>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into >>>>>> Solr. >>>>>> >>>>>> MCF has ingested into Solr documents that returned HTTP error let's >>>>>> says 401, 403, 404 or have a certain content like "this page has expired >>>>>> and has been removed" >>>>>> >>>>>> The question is: >>>>>> is there a way to tell MCF to ingest >>>>>> - only document not containing a certain content like "Not Found" or >>>>>> - only document excluding those with header 401, 403, 404, 500, ... >>>>>> >>>>>> Thank you very much. >>>>>> >>>>>> Arcadius. >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Arcadius Ahouansou >>>> Menelic Ltd | Information is Power >>>> M: 07908761999 >>>> W: www.menelic.com >>>> --- >>>> >>> >>> >> >> >> -- >> Arcadius Ahouansou >> Menelic Ltd | Information is Power >> M: 07908761999 >> W: www.menelic.com >> --- >> > >
