I finally found the missing documents in simple history by going longer back in time. They were deleted from Solr in May which seem to indicate that they shouldn't be included for some reason I haven't found.

The scheduled date from "document status" seems odd as well:
01-01-1970 01:00:00.000

This date shows up for all the missing documents. Can this be the source of the problem?

I changed the log level for HttpClient to DEBUG just in case. No network or other problems. The missing documents are not being fetched:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

If MCF should try to refetch unavailable documents, we should expect to see entries about these hosts in the manifoldcf.log. The only entries are the two documents as I previously mentioned:
http://www.ibsen.uio.no/
http://www.ibsen.uio.no/forside.xhtml

Thus there is no need to enter one document after another in the seed list. Well, I did, but without any help. The first links that appear on the main page and that I tried to include are:
http://www.ibsen.uio.no/skuespill.xhtml
http://www.ibsen.uio.no/dikt.xhtml
http://www.ibsen.uio.no/brev.xhtml
http://www.ibsen.uio.no/sakprosa.xhtml
http://www.ibsen.uio.no/varia.xhtml
http://www.ibsen.uio.no/undervisningsressurser.xhtml

Erlend

On 8/12/13 2:21 PM, Karl Wright wrote:
Hi Erlend,

I suggest you start with the seed document.  Did that get fetched?
Then, chase the path to the missing document.  Did those get fetched?
Stop with the FIRST document that did not get fetched, and see if you
can figure out why.

Thanks,
Karl



On Mon, Aug 12, 2013 at 8:16 AM, Erlend Garåsen <[email protected]
<mailto:[email protected]>> wrote:

    On 8/12/13 1:31 PM, Karl Wright wrote:

        Based on your report that the test environment works OK, and the
        production environment does not, I expect there is something
        like this
        going on.  I know you attempted to fetch the intervening
        document from
        your test environment, but it is conceivable that the production
        environment is unable to get it.  You should see evidence of
        that in the
        simple history, if so.


    I have looked through the complete history regarding this host, and
    none of the other documents have ever been fetched. The only thing I
    can see is an illegal robots.txt file:
    robots parse www.ibsen.uio.no:80 <http://www.ibsen.uio.no:80>
             HTML    0       1       Robots file contained HTML, skipped

    I don't think this robots file has stopped MCF from crawling the
    other documents since I can see this entry in the our test
    environment as well. I even tried to disable robots.txt checks, but
    the problems persist.

    I forgot to mention that the hopcount mode is "Keep unreachable
    documents, forever"

    So, if I understand you correctly, there is no point of hacking the
    database since MCF will try to refetch unreachable documents anyway.
    I can of course enable HttpClient logging and check whether MCF
    tries to fetch these resources at all.

    Erlend



Reply via email to