Re: Content filltering/exclusion with MCF

Karl Wright Thu, 30 Apr 2015 00:28:01 -0700

I've created a ticket to continue the discussion about whether we want such
a feature and if so what it should look like.  CONNECTORS-1193.


Karl


On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <[email protected]> wrote:

> Hi Arcadius,
>
> The key question is, how big do you expect the dictionary to become?
>
> The current algorithm for finding content matches for determining whether
> a page is part of a login sequence uses regexps on a line-by-line basis.
> This is not ideal because there is no guarantee that the text will have
> line breaks, and so it might have to accumulate the entire document in
> memory, which is obviously very bad.
>
> Content matching is currently done within the confines of html; the html
> is parsed and only the content portions are matched.  Tags are not
> checked.  If the aho-corasick algorithm is used, it would need to be done
> the same way: one line at a time only.
>
> Karl
>
>
>
> On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <[email protected]>
> wrote:
>
>>
>> Hello Karl.
>>
>> I agree, this would be slower than the usual filtering by url or HTTP
>> header.
>>
>> On the other hand, this would be a very useful feature:
>> Could be used to remove documents containing swear words from index or
>> remove adult content or discard emails flagged as spam etc.
>>
>> Regarding the implementation.
>> So far in MCF, regex have been used for pattern matching.
>> In the case of a content filtering, the user will supply a king of
>> "dictionary" that we will use to determine whether the document will go
>> through or not.
>> The dictionary can grow quite a bit.
>>
>> The other alternative to regex may be the Aho–Corasick string matching
>> algorithm
>> A java implementation can be found at
>> https://github.com/robert-bor/aho-corasick
>> Let's say in my dictionary, I have tow entries "expired" and "not found".
>> the algorithm will return either "expired", "not found" or both depending
>> on what it found in the document.
>> This output could be used to decide whether to index it or not.
>>
>> In this specific case where we only want to exclude a content from the
>> index, we could exit on the first match i.e no need to match the whole
>> dictionary.
>> There is a pull-request for dealing with that
>> https://github.com/robert-bor/aho-corasick/pull/14
>>
>> Thanks.
>>
>> Arcadius.
>>
>> On 29 April 2015 at 22:50, Karl Wright <[email protected]> wrote:
>>
>>> Hi Arcadius,
>>>
>>> A feature like this is possible but could be very slow, since there's no
>>> definite limit on the size of an html page.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <
>>> [email protected]> wrote:
>>>
>>>>
>>>> Hello Karl.
>>>>
>>>> I have checked the Simple History and I could see deletions.
>>>>
>>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>>> crawled data. That may be the reason why I have in Solr document that lead
>>>> to 404.
>>>>
>>>> Clearing my Solr index and resetting the crawler may help solve my
>>>> problem.
>>>>
>>>> On the other hand, some of the page I am crawling display friendly
>>>> messages such as "The document you are looking for has expired" with a 200
>>>> HTTP header instead of 404.
>>>> How feasible would it be to exclude document from the index based on
>>>> the content on the document?
>>>>
>>>> Thank you very much.
>>>>
>>>> Arcadius.
>>>>
>>>>
>>>>
>>>> On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote:
>>>>
>>>>> Hi Arcadius,
>>>>>
>>>>> So, to be clear, the repository connection you are using is a web
>>>>> connection type?
>>>>>
>>>>> The web connector has the following code which should prevent indexing
>>>>> of any content that was received with a response type of 200:
>>>>>
>>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>>       if (responseCode != 200)
>>>>>       {
>>>>>         if (Logging.connectors.isDebugEnabled())
>>>>>           Logging.connectors.debug("Web: For document
>>>>> '"+documentIdentifier+"', not indexing because response code not 
>>>>> indexable:
>>>>> "+responseCode);
>>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>>         errorDesc = "HTTP response code not indexable
>>>>> ("+responseCode+")";
>>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>>         return;
>>>>>       }
>>>>>
>>>>>
>>>>> You should indeed see these cases logged in the simple history and no
>>>>> document sent to Solr.  Is this not what you are seeing?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into
>>>>>> Solr.
>>>>>>
>>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>>> and has been removed"
>>>>>>
>>>>>> The question is:
>>>>>> is there a way to tell MCF to ingest
>>>>>> - only document not containing a certain content like "Not Found" or
>>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Arcadius.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Arcadius Ahouansou
>>>> Menelic Ltd | Information is Power
>>>> M: 07908761999
>>>> W: www.menelic.com
>>>> ---
>>>>
>>>
>>>
>>
>>
>> --
>> Arcadius Ahouansou
>> Menelic Ltd | Information is Power
>> M: 07908761999
>> W: www.menelic.com
>> ---
>>
>
>

Re: Content filltering/exclusion with MCF

Reply via email to