What version of ManifoldCF is this? I ask because I updated the logging output in 1.3 to capture a number of cases that previously did not log a reason why they were skipped.
Karl On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen <[email protected]>wrote: > > OK, I have now changed the log level from INFO to DEBUG for connectors as > well. Here's the log: > http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log> > > The following entry indicates that one of the missing URLs is > found/extracted from a link: > DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html document ' > http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>', > found link to > 'http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml> > ' > > Then the job just ends and all the extracted links were never fetched. > > Erlend > > > On 8/12/13 5:15 PM, Erlend Garåsen wrote: > >> >> Thanks, I will tomorrow and report thereafter. I hope we will find a >> simple explanation. :) >> >> E >> >> On 8/12/13 5:07 PM, Karl Wright wrote: >> >>> Hi Erlend, >>> >>> You have wire logging (httpclient) enabled, which is useful for >>> debugging fetch issues, but you do not have connector debugging on. To >>> turn it on, add this to properties.xml: >>> >>> <property name="org.apache.manifoldcf.**connectors" value="DEBUG"/> >>> >>> thanks, >>> Karl >>> >>> >>> On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen >>> <[email protected] >>> <mailto:[email protected].**no<[email protected]>>> >>> wrote: >>> >>> On 8/12/13 4:29 PM, Karl Wright wrote: >>> >>> Hi Erlend, >>> >>> The Document Status report shows these documents because they >>> are still >>> in the queue. The reasons for this could be several. Documents >>> that >>> exceed the hopcount by 1 level are allowed to remain in the >>> queue for >>> bookkeeping purposes. "scheduled date" as given only meaningful >>> if the >>> document is in an active state; my guess is that these documents >>> are not >>> in fact in that state, but rather in the state >>> HOPCOUNT_EXCEEDED. Can >>> you include one complete row from the Document Status report for >>> one of >>> the missing documents? >>> >>> >>> For >>> "http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml> >>> >>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml> >>> >": >>> Job: Ibsen >>> >>> State: Out of scope >>> Status: Hopcount exceeded >>> Scheduled: 01-01-1970 01:00:00.000 >>> Scheduled action: Process >>> Retry count: N/A >>> Retry limit: N/A >>> >>> >>> When you added documents to the seed list, what did the Simple >>> History >>> say when they were fetched? If they don't appear in the simple >>> history, >>> they SHOULD have nevertheless appeared in the log, with an >>> explanation >>> of why they were excluded, provided you have connector debugging >>> enabled. >>> >>> >>> OK, here is the seed list: >>> http://www.ibsen.uio.no/ >>> >>> >>> http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml> >>> >>> <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml> >>> > >>> >>> http://www.ibsen.uio.no/dikt._**_xhtml<http://www.ibsen.uio.no/dikt.__xhtml> >>> >>> <http://www.ibsen.uio.no/dikt.**xhtml<http://www.ibsen.uio.no/dikt.xhtml> >>> > >>> >>> http://www.ibsen.uio.no/brev._**_xhtml<http://www.ibsen.uio.no/brev.__xhtml> >>> >>> <http://www.ibsen.uio.no/brev.**xhtml<http://www.ibsen.uio.no/brev.xhtml> >>> > >>> >>> http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml> >>> >>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml> >>> > >>> >>> http://www.ibsen.uio.no/varia.**__xhtml<http://www.ibsen.uio.no/varia.__xhtml> >>> >>> <http://www.ibsen.uio.no/**varia.xhtml<http://www.ibsen.uio.no/varia.xhtml> >>> > >>> >>> http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml> >>> >>> <http://www.ibsen.uio.no/**undervisningsressurser.xhtml<http://www.ibsen.uio.no/undervisningsressurser.xhtml> >>> > >>> >>> Here is the results from simple history: >>> 08-12-2013 16:46:26.536 job end 1368534065016(Ibsen) >>> 0 1 >>> 08-12-2013 16:46:09.927 document ingest (Solr) >>> >>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml> >>> >>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml> >>> > >>> OK 11897 178 >>> 08-12-2013 16:46:09.751 fetch >>> >>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml> >>> >>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml> >>> > >>> 200 11897 17 >>> 08-12-2013 16:44:48.829 fetch http://www.ibsen.uio.no/ >>> 302 0 79484 >>> 08-12-2013 16:44:48.727 robots parse www.ibsen.uio.no:80 >>> <http://www.ibsen.uio.no:80> >>> >>> HTML 0 2 Robots file contained HTML, skipped >>> 08-12-2013 16:44:46.574 job start 1368534065016(Ibsen) >>> 0 1 >>> 1 >>> >>> HttpClient log: >>> >>> http://folk.uio.no/erlendfg/__**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log> >>> >>> <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log> >>> > >>> >>> Erlend >>> >>> >>> >> >
