Re: Specifications of HopFilters "Keep unreachable documents"

Kayak28 Wed, 13 Nov 2019 00:55:57 -0800

Hello, Mr. Karl, Mr. Issei, and Community members.

I have a similar issue with Mr.Issei.


Here is a sample website structure that I want to crawl with MCF.
index.html -link to -> sample1.html -link to-> sample2.html
I made this sample website to explore the behavior of "Hop count mode."

The settings of the1st run are as follows:
Repository: web
Output: File System
Job
- hopcount filters :
-- Maximum hop count for type 'link' = empty
-- Maximum hop count for type 'redirect' = empty
-- Hop count mode "keep unreachable document, forever"

As far as I saw in the result from the simple history report, MCF does
1. job start
2. fetch, process, ingest for index.html
3. fetch, process, ingest for sample1.html
4. fetch, process, ingest for sample2.html
5. job end

The actual output directory is below:
$ ls -la
total 24
drwxr-xr-x  5 kaya  staff  160 Nov 13 15:51 ./
drwxr-xr-x  3 kaya  staff   96 Nov 13 15:51 ../
-rw-r--r--  1 kaya  staff  152 Nov 13 15:51 .1
-rw-r--r--  1 kaya  staff  193 Nov 13 15:51 sample1.html
-rw-r--r--  1 kaya  staff  144 Nov 13 15:51 sample2.html

The above sequence of the activities and its corresponding results
completely make sense to me.
(I just want to provide some background )


We decided to delete sample2.html.
So, now our sample website structure is updated to:
index.html --> sample1.html
Note:
index.html is not edited, so I expect it is not crawled.
sample1.html is edited (deleted a tag ) and exists, so I expect it to be
crawled.
sample2.html is deleted so it does not exist, so I expect it not to be
crawled.
I do not edit the setting for MCF from the 1st.

Again, I check the result from the simple history report, MCF does:
1. job start
2. fetch for index.html
3. fetch, process, ingest for sample1.html
4. document deletion for sample2.html

The actual output directory is like below:

$ ls -la
total 16
drwxr-xr-x  5 kaya  staff  160 Nov 13 15:51 ./
drwxr-xr-x  3 kaya  staff   96 Nov 13 15:51 ../
-rw-r--r--  1 kaya  staff  152 Nov 13 15:51 .1
-rw-r--r--  1 kaya  staff  181 Nov 13 16:28 sample1.html
-rw-r--r--  1 kaya  staff    0 Nov 13 16:28 sample2.html

What I want to do here is, I want to "keep" sample2.html with contents in
it.
And, Here is my 2 questions come in:
1. How can I keep sample2.html with its contents?  i.e. I want to keep
sample2.html to be 144 bytes.
2. As far as I saw the result, the result does not "keep the unreachable
document" (which in this case sample2.html),
What is the setting "keep unreachable document " actually preserving?

You may think, "deleting sample2.html means no longer accessible from
anyone,
so it should not be a problem if MCF also deletes the content of
sample2.html anyway."
I mainly agree, but I am exploring each hop count mode for now.
I just want to save unreachable documents to my local.


I would really appreciate any input on the matter.

Sincerely,
Kaya Ota









2019年11月9日(土) 0:46 Karl Wright <[email protected]>:

> ' The reason I am asking these question is a document was deleted that I
> thought it was not going to be.'
>
> This would only happen if you had "Delete unreachable documents" as the
> selection.  Otherwise it would not happen.
>
> It sounds to me like you just want to disable hopcount filters entirely.
> In that case, leave the hop count filter value empty, and if you are sure
> about this being the way you want to run the job forever, choose the "keep
> unreachable documents forever" selection.
>
> I am not in a position (I do not have the time) to work out detailed
> examples.  You can explore the behavior on your own if you so desire.
> Karl
>
>
> On Fri, Nov 8, 2019 at 10:10 AM Issei Nishigata <[email protected]>
> wrote:
>
>> Hi Karl,
>>
>>
>> Thank you for a quick response.
>>
>> It seems that I have completely misunderstood the specifications so it'd
>> be helpful if you could show specific examples for each Hop count mode.
>>
>> Is those below my understanding correct?
>> - "keep unreachable documents, for now" and "... forever" is the settings
>> that does not delete documents from the index that were not crawled.
>> - hop count dependency information is like a cache of the link structure.
>> This link structure is not recreated in "keep unreachable documents
>> forever" mode, so it is faster to crawl.
>>
>> The reason I am asking these question is a document was deleted that I
>> thought it was not going to be.
>> Is there any way that it does not delete? What does it "keep" in "keep
>> unreachable document"?
>>
>>
>> Sincerely,
>> Issei Nishigata
>>
>>
>>
>> On 2019/11/08 2:19, Karl Wright wrote:
>> > Hi Issei,
>> >
>> > The setting of "Keep unreachable documents forever" basically means
>> that no hop count dependency information is kept around for any crawls done
>> > when that setting is in place.  That means that when links change or
>> documents change the system does not know how to recompute the hopcount
>> > accurately.  This setting is appropriate if you want your crawl to be
>> as fast as possible and do not expect ever to use hop count filtering for
>> > the job in question.
>> >
>> > The "keep unreachable documents for now" means that enough information
>> is kept around that if you decided to put a hop count filter into place
>> > later, it would still work properly.
>> >
>> > Hope that helps.
>> >
>> > Karl
>> >
>> >
>> > On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <[email protected]
>> <mailto:[email protected]>> wrote:
>> >
>> >     Hi All,
>> >
>> >
>> >     I use MCF2.12, and I have confused about specifications of
>> HopFilters "Keep unreachable documents".
>> >
>> >     I understand that the "Keep unrechable documents, for now" and
>> "Keep unreacheable documents, forever" of HopFilter
>> >     is an effective setting when specifying HopCount.
>> >
>> >     For example, crawling all data with specifying the empty value on
>> HopCount at first time, and the second time,
>> >     putting 0 in the value of HopCount with "Keep unreachable
>> documents, for now", only the first layer of the directory
>> >     will be crawled and the second and deeper layers, which are not
>> crawled, will not be deleted from the index.
>> >
>> >     However, when actually processing as the above setting, document on
>> second layer is deleted from index
>> >     when processing second time and after that. It works same way when
>> using "Keep unreacheable documents, forever".
>> >
>> >     Is there anything wrong with my understanding? and Does anyone know
>> about difference between these two settings,
>> >     "Keep unrechable documents, for now" and "Keep unreacheable
>> documents, forever"?
>> >
>> >     If anyone of you knows about the specs of these settings, then it
>> is very helpful to share your bits of advice.
>> >     Any clue will be very appreciated.
>> >
>> >
>> >     Sincerely,
>> >     Issei Nishigata
>> >
>>
>

Re: Specifications of HopFilters "Keep unreachable documents"

Reply via email to