I couldn't find more information in the log after the upgrade. Yes, I'm running version 1.3 now since I had to log in after the upgrade:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log

I tried to fetch one of the missing documents by using Curl from our prod server. Looks like an OK response to me even though this is Curl and not HttpClient:

-bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected])" "http://www.ibsen.uio.no/sakprosa.xhtml";
* About to connect() to www.ibsen.uio.no port 80
*   Trying 129.240.7.27... connected
* Connected to www.ibsen.uio.no (129.240.7.27) port 80
> GET /sakprosa.xhtml HTTP/1.1
> Host: www.ibsen.uio.no
> Accept: */*
> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; [email protected])
>
< HTTP/1.1 200 OK
< Date: Tue, 13 Aug 2013 12:40:02 GMT
< Server: Apache-Coyote/1.1
< X-Cocoon-Version: 2.1.12-dev
< Last-Modified: Fri, 09 Aug 2013 09:57:43 GMT
< Content-Type: text/html
< Content-Length: 11209
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
[...]

E

On 8/13/13 12:04 PM, Karl Wright wrote:
If this is still 1.2, then these were the unlogged reasons why a
document could be skipped:

(1) Length too long
(2) Output connector rejects mime type
(3) Output connector rejects url
(4) Document is not considered indexable according to the job
constraints (the "indexable" regular expressions)

Karl



On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <[email protected]
<mailto:[email protected]>> wrote:

    What version of ManifoldCF is this?

    I ask because I updated the logging output in 1.3 to capture a
    number of cases that previously did not log a reason why they were
    skipped.

    Karl



    On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen
    <[email protected] <mailto:[email protected]>> wrote:


        OK, I have now changed the log level from INFO to DEBUG for
        connectors as well. Here's the log:
        http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
        <http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>

        The following entry indicates that one of the missing URLs is
        found/extracted from a link:
        DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html
        document 'http://www.ibsen.uio.no/__forside.xhtml
        <http://www.ibsen.uio.no/forside.xhtml>', found link to
        'http://www.ibsen.uio.no/__skuespill.xhtml
        <http://www.ibsen.uio.no/skuespill.xhtml>'

        Then the job just ends and all the extracted links were never
        fetched.

        Erlend


        On 8/12/13 5:15 PM, Erlend Garåsen wrote:


            Thanks, I will tomorrow and report thereafter. I hope we
            will find a
            simple explanation. :)

            E

            On 8/12/13 5:07 PM, Karl Wright wrote:

                Hi Erlend,

                You have wire logging (httpclient) enabled, which is
                useful for
                debugging fetch issues, but you do not have connector
                debugging on.  To
                turn it on, add this to properties.xml:

                <property name="org.apache.manifoldcf.__connectors"
                value="DEBUG"/>

                thanks,
                Karl


                On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
                <[email protected]
                <mailto:[email protected]>
                <mailto:[email protected].__no
                <mailto:[email protected]>>> wrote:

                     On 8/12/13 4:29 PM, Karl Wright wrote:

                         Hi Erlend,

                         The Document Status report shows these
                documents because they
                         are still
                         in the queue.  The reasons for this could be
                several.  Documents
                         that
                         exceed the hopcount by 1 level are allowed to
                remain in the
                         queue for
                         bookkeeping purposes.  "scheduled date" as
                given only meaningful
                         if the
                         document is in an active state; my guess is
                that these documents
                         are not
                         in fact in that state, but rather in the state
                         HOPCOUNT_EXCEEDED.  Can
                         you include one complete row from the Document
                Status report for
                         one of
                         the missing documents?


                     For "http://www.ibsen.uio.no/____sakprosa.xhtml
                <http://www.ibsen.uio.no/__sakprosa.xhtml>
                     <http://www.ibsen.uio.no/__sakprosa.xhtml
                <http://www.ibsen.uio.no/sakprosa.xhtml>>":
                     Job: Ibsen

                     State: Out of scope
                     Status: Hopcount exceeded
                     Scheduled: 01-01-1970 01:00:00.000
                     Scheduled action: Process
                     Retry count: N/A
                     Retry limit: N/A


                         When you added documents to the seed list, what
                did the Simple
                         History
                         say when they were fetched?  If they don't
                appear in the simple
                         history,
                         they SHOULD have nevertheless appeared in the
                log, with an
                         explanation
                         of why they were excluded, provided you have
                connector debugging
                         enabled.


                     OK, here is the seed list:
                http://www.ibsen.uio.no/

                http://www.ibsen.uio.no/____skuespill.xhtml
                <http://www.ibsen.uio.no/__skuespill.xhtml>
                     <http://www.ibsen.uio.no/__skuespill.xhtml
                <http://www.ibsen.uio.no/skuespill.xhtml>>
                http://www.ibsen.uio.no/dikt.____xhtml
                <http://www.ibsen.uio.no/dikt.__xhtml>
                     <http://www.ibsen.uio.no/dikt.__xhtml
                <http://www.ibsen.uio.no/dikt.xhtml>>
                http://www.ibsen.uio.no/brev.____xhtml
                <http://www.ibsen.uio.no/brev.__xhtml>
                     <http://www.ibsen.uio.no/brev.__xhtml
                <http://www.ibsen.uio.no/brev.xhtml>>
                http://www.ibsen.uio.no/____sakprosa.xhtml
                <http://www.ibsen.uio.no/__sakprosa.xhtml>
                     <http://www.ibsen.uio.no/__sakprosa.xhtml
                <http://www.ibsen.uio.no/sakprosa.xhtml>>
                http://www.ibsen.uio.no/varia.____xhtml
                <http://www.ibsen.uio.no/varia.__xhtml>
                     <http://www.ibsen.uio.no/__varia.xhtml
                <http://www.ibsen.uio.no/varia.xhtml>>
                http://www.ibsen.uio.no/____undervisningsressurser.xhtml
                <http://www.ibsen.uio.no/__undervisningsressurser.xhtml>

                <http://www.ibsen.uio.no/__undervisningsressurser.xhtml
                <http://www.ibsen.uio.no/undervisningsressurser.xhtml>>

                     Here is the results from simple history:
                     08-12-2013 16:46:26.536         job end
                1368534065016(Ibsen)
                                      0       1
                     08-12-2013 16:46:09.927         document ingest (Solr)
                http://www.ibsen.uio.no/____forside.xhtml
                <http://www.ibsen.uio.no/__forside.xhtml>
                     <http://www.ibsen.uio.no/__forside.xhtml
                <http://www.ibsen.uio.no/forside.xhtml>>
                              OK      11897   178
                     08-12-2013 16:46:09.751         fetch
                http://www.ibsen.uio.no/____forside.xhtml
                <http://www.ibsen.uio.no/__forside.xhtml>
                     <http://www.ibsen.uio.no/__forside.xhtml
                <http://www.ibsen.uio.no/forside.xhtml>>
                              200     11897   17
                     08-12-2013 16:44:48.829         fetch
                http://www.ibsen.uio.no/
                              302     0       79484
                     08-12-2013 16:44:48.727         robots parse
                www.ibsen.uio.no:80 <http://www.ibsen.uio.no:80>
                     <http://www.ibsen.uio.no:80>

                              HTML    0       2       Robots file
                contained HTML, skipped
                     08-12-2013 16:44:46.574         job start
                1368534065016(Ibsen)
                                      0       1
                              1

                     HttpClient log:
                http://folk.uio.no/erlendfg/____manifoldcf/manifoldcf.log 
<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>

                <http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
                <http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>>

                     Erlend







Reply via email to