Thanks Karl. I will comment on it ASAP On 30 April 2015 at 08:26, Karl Wright <[email protected]> wrote:
> I've created a ticket to continue the discussion about whether we want > such a feature and if so what it should look like. CONNECTORS-1193. > > Karl > > > On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <[email protected]> wrote: > >> Hi Arcadius, >> >> The key question is, how big do you expect the dictionary to become? >> >> The current algorithm for finding content matches for determining whether >> a page is part of a login sequence uses regexps on a line-by-line basis. >> This is not ideal because there is no guarantee that the text will have >> line breaks, and so it might have to accumulate the entire document in >> memory, which is obviously very bad. >> >> Content matching is currently done within the confines of html; the html >> is parsed and only the content portions are matched. Tags are not >> checked. If the aho-corasick algorithm is used, it would need to be done >> the same way: one line at a time only. >> >> Karl >> >> >> >> On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <[email protected] >> > wrote: >> >>> >>> Hello Karl. >>> >>> I agree, this would be slower than the usual filtering by url or HTTP >>> header. >>> >>> On the other hand, this would be a very useful feature: >>> Could be used to remove documents containing swear words from index or >>> remove adult content or discard emails flagged as spam etc. >>> >>> Regarding the implementation. >>> So far in MCF, regex have been used for pattern matching. >>> In the case of a content filtering, the user will supply a king of >>> "dictionary" that we will use to determine whether the document will go >>> through or not. >>> The dictionary can grow quite a bit. >>> >>> The other alternative to regex may be the Aho–Corasick string matching >>> algorithm >>> A java implementation can be found at >>> https://github.com/robert-bor/aho-corasick >>> Let's say in my dictionary, I have tow entries "expired" and "not found". >>> the algorithm will return either "expired", "not found" or both >>> depending on what it found in the document. >>> This output could be used to decide whether to index it or not. >>> >>> In this specific case where we only want to exclude a content from the >>> index, we could exit on the first match i.e no need to match the whole >>> dictionary. >>> There is a pull-request for dealing with that >>> https://github.com/robert-bor/aho-corasick/pull/14 >>> >>> Thanks. >>> >>> Arcadius. >>> >>> On 29 April 2015 at 22:50, Karl Wright <[email protected]> wrote: >>> >>>> Hi Arcadius, >>>> >>>> A feature like this is possible but could be very slow, since there's >>>> no definite limit on the size of an html page. >>>> >>>> Karl >>>> >>>> >>>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou < >>>> [email protected]> wrote: >>>> >>>>> >>>>> Hello Karl. >>>>> >>>>> I have checked the Simple History and I could see deletions. >>>>> >>>>> I have recently migrated my config to MCF 2.0.2 without migrating all >>>>> crawled data. That may be the reason why I have in Solr document that lead >>>>> to 404. >>>>> >>>>> Clearing my Solr index and resetting the crawler may help solve my >>>>> problem. >>>>> >>>>> On the other hand, some of the page I am crawling display friendly >>>>> messages such as "The document you are looking for has expired" with a 200 >>>>> HTTP header instead of 404. >>>>> How feasible would it be to exclude document from the index based on >>>>> the content on the document? >>>>> >>>>> Thank you very much. >>>>> >>>>> Arcadius. >>>>> >>>>> >>>>> >>>>> On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote: >>>>> >>>>>> Hi Arcadius, >>>>>> >>>>>> So, to be clear, the repository connection you are using is a web >>>>>> connection type? >>>>>> >>>>>> The web connector has the following code which should prevent >>>>>> indexing of any content that was received with a response type of 200: >>>>>> >>>>>> int responseCode = cache.getResponseCode(documentIdentifier); >>>>>> if (responseCode != 200) >>>>>> { >>>>>> if (Logging.connectors.isDebugEnabled()) >>>>>> Logging.connectors.debug("Web: For document >>>>>> '"+documentIdentifier+"', not indexing because response code not >>>>>> indexable: >>>>>> "+responseCode); >>>>>> errorCode = "RESPONSECODENOTINDEXABLE"; >>>>>> errorDesc = "HTTP response code not indexable >>>>>> ("+responseCode+")"; >>>>>> activities.noDocument(documentIdentifier,versionString); >>>>>> return; >>>>>> } >>>>>> >>>>>> >>>>>> You should indeed see these cases logged in the simple history and no >>>>>> document sent to Solr. Is this not what you are seeing? >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into >>>>>>> Solr. >>>>>>> >>>>>>> MCF has ingested into Solr documents that returned HTTP error let's >>>>>>> says 401, 403, 404 or have a certain content like "this page has expired >>>>>>> and has been removed" >>>>>>> >>>>>>> The question is: >>>>>>> is there a way to tell MCF to ingest >>>>>>> - only document not containing a certain content like "Not Found" or >>>>>>> - only document excluding those with header 401, 403, 404, 500, ... >>>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> Arcadius. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Arcadius Ahouansou >>>>> Menelic Ltd | Information is Power >>>>> M: 07908761999 >>>>> W: www.menelic.com >>>>> --- >>>>> >>>> >>>> >>> >>> >>> -- >>> Arcadius Ahouansou >>> Menelic Ltd | Information is Power >>> M: 07908761999 >>> W: www.menelic.com >>> --- >>> >> >> > -- Arcadius Ahouansou Menelic Ltd | Information is Power M: 07908761999 W: www.menelic.com ---
