Thanks Karl,

Maybe some documents became unreachable at the time I tried to reproduce some problems I had with this host for some months agp. But the thing is that our test environment also crawls 50% more documents for other jobs as well. This might be due to unreachable documents.

What is the best approach to tell MCF that all documents should be processed again? Manually delete some tables from the database?

Erlend

On 8/12/13 1:31 PM, Karl Wright wrote:
Hi Erlend,

If any link in the chain from the seed to the document is broken, a
document reachable on a previous crawl can become unreachable and thus
report "Hop count exceeded".  In this case, the document must have been
queued somehow - or must have been present from a previous crawl.

So, for example, suppose you have this chain:

A->B->C

... and then all of a sudden, B cannot be fetched.  Then, C will report
that its hopcount is exceeded.

Based on your report that the test environment works OK, and the
production environment does not, I expect there is something like this
going on.  I know you attempted to fetch the intervening document from
your test environment, but it is conceivable that the production
environment is unable to get it.  You should see evidence of that in the
simple history, if so.

I can try a sample crawl from home tonight if you like, and we can see
whether I get the reduced set or the complete one.  However, bear in
mind that hopcount is one of MCF's most rigorously tested features, so I
personally doubt there is a problem with the hopcount logic per se.

Thanks,
Karl



On Mon, Aug 12, 2013 at 6:39 AM, Erlend Garåsen <[email protected]
<mailto:[email protected]>> wrote:


    I have discovered an odd thing regarding hop counts. Our prod
    environment crawls a lot fewer documents compared to our test
    environment even though the configuration is exactly the same. Then
    I figured out that several documents which are expected to be
    fetched are, according to MCF, outside the hop count limit, but
    they're not.

    This can be reproduced by using a small job for one particular host,
    www.ibsen.uio.no <http://www.ibsen.uio.no>. The seed list is as follows:

    http://www.ibsen.uio.no/

    Hop filter settings are:
    link: 6
    redirect: 3

    Only these two documents are fetched:
    http://www.ibsen.uio.no/__forside.xhtml
    <http://www.ibsen.uio.no/forside.xhtml>
    http://www.ibsen.uio.no/

    Here's what MCF says about one omitted document, i.e.,
    http://www.ibsen.uio.no/__skuespill.xhtml
    <http://www.ibsen.uio.no/skuespill.xhtml>:
    State: out of scope
    Status: Hopcount exceeded

    This is odd. If you open up www.ibsen.uio.no
    <http://www.ibsen.uio.no>, you can see that the link
    "http://www.ibsen.uio.no/__skuespill.xhtml
    <http://www.ibsen.uio.no/skuespill.xhtml>" (Skuespill) appears on
    the main page.

    Our test environment fetches this document without problems.

    Erlend



Reply via email to