Re: Content filltering/exclusion with MCF

Karl Wright Wed, 29 Apr 2015 16:31:07 -0700

Hi Arcadius,

The key question is, how big do you expect the dictionary to become?


The current algorithm for finding content matches for determining whether a
page is part of a login sequence uses regexps on a line-by-line basis.
This is not ideal because there is no guarantee that the text will have
line breaks, and so it might have to accumulate the entire document in
memory, which is obviously very bad.

Content matching is currently done within the confines of html; the html is
parsed and only the content portions are matched.  Tags are not checked.
If the aho-corasick algorithm is used, it would need to be done the same
way: one line at a time only.

Karl



On Wed, Apr 29, 2015 at 7:02 PM, Arcadius Ahouansou <[email protected]>
wrote:

>
> Hello Karl.
>
> I agree, this would be slower than the usual filtering by url or HTTP
> header.
>
> On the other hand, this would be a very useful feature:
> Could be used to remove documents containing swear words from index or
> remove adult content or discard emails flagged as spam etc.
>
> Regarding the implementation.
> So far in MCF, regex have been used for pattern matching.
> In the case of a content filtering, the user will supply a king of
> "dictionary" that we will use to determine whether the document will go
> through or not.
> The dictionary can grow quite a bit.
>
> The other alternative to regex may be the Aho–Corasick string matching
> algorithm
> A java implementation can be found at
> https://github.com/robert-bor/aho-corasick
> Let's say in my dictionary, I have tow entries "expired" and "not found".
> the algorithm will return either "expired", "not found" or both depending
> on what it found in the document.
> This output could be used to decide whether to index it or not.
>
> In this specific case where we only want to exclude a content from the
> index, we could exit on the first match i.e no need to match the whole
> dictionary.
> There is a pull-request for dealing with that
> https://github.com/robert-bor/aho-corasick/pull/14
>
> Thanks.
>
> Arcadius.
>
> On 29 April 2015 at 22:50, Karl Wright <[email protected]> wrote:
>
>> Hi Arcadius,
>>
>> A feature like this is possible but could be very slow, since there's no
>> definite limit on the size of an html page.
>>
>> Karl
>>
>>
>> On Wed, Apr 29, 2015 at 5:01 PM, Arcadius Ahouansou <[email protected]
>> > wrote:
>>
>>>
>>> Hello Karl.
>>>
>>> I have checked the Simple History and I could see deletions.
>>>
>>> I have recently migrated my config to MCF 2.0.2 without migrating all
>>> crawled data. That may be the reason why I have in Solr document that lead
>>> to 404.
>>>
>>> Clearing my Solr index and resetting the crawler may help solve my
>>> problem.
>>>
>>> On the other hand, some of the page I am crawling display friendly
>>> messages such as "The document you are looking for has expired" with a 200
>>> HTTP header instead of 404.
>>> How feasible would it be to exclude document from the index based on the
>>> content on the document?
>>>
>>> Thank you very much.
>>>
>>> Arcadius.
>>>
>>>
>>>
>>> On 28 April 2015 at 12:18, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Arcadius,
>>>>
>>>> So, to be clear, the repository connection you are using is a web
>>>> connection type?
>>>>
>>>> The web connector has the following code which should prevent indexing
>>>> of any content that was received with a response type of 200:
>>>>
>>>>       int responseCode = cache.getResponseCode(documentIdentifier);
>>>>       if (responseCode != 200)
>>>>       {
>>>>         if (Logging.connectors.isDebugEnabled())
>>>>           Logging.connectors.debug("Web: For document
>>>> '"+documentIdentifier+"', not indexing because response code not indexable:
>>>> "+responseCode);
>>>>         errorCode = "RESPONSECODENOTINDEXABLE";
>>>>         errorDesc = "HTTP response code not indexable
>>>> ("+responseCode+")";
>>>>         activities.noDocument(documentIdentifier,versionString);
>>>>         return;
>>>>       }
>>>>
>>>>
>>>> You should indeed see these cases logged in the simple history and no
>>>> document sent to Solr.  Is this not what you are seeing?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Apr 28, 2015 at 7:01 AM, Arcadius Ahouansou <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>> Hello.
>>>>>
>>>>> I am using MCF 2.0.2 for crawling the web and ingesting data into Solr.
>>>>>
>>>>> MCF has ingested into Solr documents that returned HTTP error let's
>>>>> says 401, 403, 404 or have a certain content like "this page has expired
>>>>> and has been removed"
>>>>>
>>>>> The question is:
>>>>> is there a way to tell MCF to ingest
>>>>> - only document not containing a certain content like "Not Found" or
>>>>> - only document excluding those with header 401, 403, 404, 500, ...
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Arcadius.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Arcadius Ahouansou
>>> Menelic Ltd | Information is Power
>>> M: 07908761999
>>> W: www.menelic.com
>>> ---
>>>
>>
>>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>

Re: Content filltering/exclusion with MCF

Reply via email to