Re: Content filltering/exclusion with MCF

Arcadius Ahouansou Thu, 30 Apr 2015 00:52:38 -0700

Thanks Karl.
I will comment on it ASAP

On 30 April 2015 at 08:26, Karl Wright <[email protected]> wrote:


> I've created a ticket to continue the discussion about whether we want
> such a feature and if so what it should look like.  CONNECTORS-1193.
>
> Karl
>
>
> On Wed, Apr 29, 2015 at 7:28 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Arcadius,
>>
>> The key question is, how big do you expect the dictionary to become?
>>
>> The current algorithm for finding content matches for determining whether
>> a page is part of a login sequence uses regexps on a line-by-line basis.
>> This is not ideal because there is no guarantee that the text will have
>> line breaks, and so it might have to accumulate the entire document in
>> memory, which is obviously very bad.
>>
>> Content matching is currently done within the confines of html; the html
>> is parsed and only the content portions are matched.  Tags are not
>> checked.  If the aho-corasick algorithm is used, it would need to be done
>> the same way: one line at a time only.
>>
>> Karl
>>
>>
>>
>> On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <[email protected]
>> > wrote:
>>
>>>
>>> Hello Karl.
>>>
>>> I agree, this would be slower than the usual filtering by url or HTTP
>>> header.
>>>
>>> On the other hand, this would be a very useful feature:
>>> Could be used to remove documents containing swear words from index or
>>> remove adult content or discard emails flagged as spam etc.
>>>
>>> Regarding the implementation.
>>> So far in MCF, regex have been used for pattern matching.
>>> In the case of a content filtering, the user will supply a king of
>>> "dictionary" that we will use to determine whether the document will go
>>> through or not.
>>> The dictionary can grow quite a bit.
>>>
>>> The other alternative to regex may be the Aho–Corasick string matching
>>> algorithm
>>> A java implementation can be found at
>>> https://github.com/robert-bor/aho-corasick
>>> Let's say in my dictionary, I have tow entries "expired" and "not found".
>>> the algorithm will return either "expired", "not found" or both
>>> depending on what it found in the document.
>>> This output could be used to decide whether to index it or not.
>>>
>>> In this specific case where we only want to exclude a content from the
>>> index, we could exit on the first match i.e no need to match the whole
>>> dictionary.
>>> There is a pull-request for dealing with that
>>> https://github.com/robert-bor/aho-corasick/pull/14
>>>
>>> Thanks.
>>>
>>> Arcadius.
>>>
>>> On 29 April 2015 at 22:50, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Arcadius,
>>>>
>>>> A feature like this is possible but could be very slow, since there's
>>>> no definite limit on the size of an html page.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>> Hello Karl.
>>>>>
>>>>> I have checked the Simple History and I could see deletions.
>>>>>
>>>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>>>> crawled data. That may be the reason why I have in Solr document that lead
>>>>> to 404.
>>>>>
>>>>> Clearing my Solr index and resetting the crawler may help solve my
>>>>> problem.
>>>>>
>>>>> On the other hand, some of the page I am crawling display friendly
>>>>> messages such as "The document you are looking for has expired" with a 200
>>>>> HTTP header instead of 404.
>>>>> How feasible would it be to exclude document from the index based on
>>>>> the content on the document?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Arcadius.
>>>>>
>>>>>
>>>>>
>>>>> On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> Hi Arcadius,
>>>>>>
>>>>>> So, to be clear, the repository connection you are using is a web
>>>>>> connection type?
>>>>>>
>>>>>> The web connector has the following code which should prevent
>>>>>> indexing of any content that was received with a response type of 200:
>>>>>>
>>>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>>>       if (responseCode != 200)
>>>>>>       {
>>>>>>         if (Logging.connectors.isDebugEnabled())
>>>>>>           Logging.connectors.debug("Web: For document
>>>>>> '"+documentIdentifier+"', not indexing because response code not 
>>>>>> indexable:
>>>>>> "+responseCode);
>>>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>>>         errorDesc = "HTTP response code not indexable
>>>>>> ("+responseCode+")";
>>>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>>>         return;
>>>>>>       }
>>>>>>
>>>>>>
>>>>>> You should indeed see these cases logged in the simple history and no
>>>>>> document sent to Solr.  Is this not what you are seeing?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into
>>>>>>> Solr.
>>>>>>>
>>>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>>>> and has been removed"
>>>>>>>
>>>>>>> The question is:
>>>>>>> is there a way to tell MCF to ingest
>>>>>>> - only document not containing a certain content like "Not Found" or
>>>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Arcadius.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Arcadius Ahouansou
>>>>> Menelic Ltd | Information is Power
>>>>> M: 07908761999
>>>>> W: www.menelic.com
>>>>> ---
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Arcadius Ahouansou
>>> Menelic Ltd | Information is Power
>>> M: 07908761999
>>> W: www.menelic.com
>>> ---
>>>
>>
>>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---

Re: Content filltering/exclusion with MCF

Reply via email to