If this is still 1.2, then these were the unlogged reasons why a document could be skipped:
(1) Length too long (2) Output connector rejects mime type (3) Output connector rejects url (4) Document is not considered indexable according to the job constraints (the "indexable" regular expressions) Karl On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <[email protected]> wrote: > What version of ManifoldCF is this? > > I ask because I updated the logging output in 1.3 to capture a number of > cases that previously did not log a reason why they were skipped. > > Karl > > > > On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen > <[email protected]>wrote: > >> >> OK, I have now changed the log level from INFO to DEBUG for connectors as >> well. Here's the log: >> http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log> >> >> The following entry indicates that one of the missing URLs is >> found/extracted from a link: >> DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html document >> 'http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>', >> found link to >> 'http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml> >> ' >> >> Then the job just ends and all the extracted links were never fetched. >> >> Erlend >> >> >> On 8/12/13 5:15 PM, Erlend Garåsen wrote: >> >>> >>> Thanks, I will tomorrow and report thereafter. I hope we will find a >>> simple explanation. :) >>> >>> E >>> >>> On 8/12/13 5:07 PM, Karl Wright wrote: >>> >>>> Hi Erlend, >>>> >>>> You have wire logging (httpclient) enabled, which is useful for >>>> debugging fetch issues, but you do not have connector debugging on. To >>>> turn it on, add this to properties.xml: >>>> >>>> <property name="org.apache.manifoldcf.**connectors" value="DEBUG"/> >>>> >>>> thanks, >>>> Karl >>>> >>>> >>>> On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen >>>> <[email protected] >>>> <mailto:[email protected].**no<[email protected]>>> >>>> wrote: >>>> >>>> On 8/12/13 4:29 PM, Karl Wright wrote: >>>> >>>> Hi Erlend, >>>> >>>> The Document Status report shows these documents because they >>>> are still >>>> in the queue. The reasons for this could be several. Documents >>>> that >>>> exceed the hopcount by 1 level are allowed to remain in the >>>> queue for >>>> bookkeeping purposes. "scheduled date" as given only meaningful >>>> if the >>>> document is in an active state; my guess is that these documents >>>> are not >>>> in fact in that state, but rather in the state >>>> HOPCOUNT_EXCEEDED. Can >>>> you include one complete row from the Document Status report for >>>> one of >>>> the missing documents? >>>> >>>> >>>> For >>>> "http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml> >>>> >": >>>> Job: Ibsen >>>> >>>> State: Out of scope >>>> Status: Hopcount exceeded >>>> Scheduled: 01-01-1970 01:00:00.000 >>>> Scheduled action: Process >>>> Retry count: N/A >>>> Retry limit: N/A >>>> >>>> >>>> When you added documents to the seed list, what did the Simple >>>> History >>>> say when they were fetched? If they don't appear in the simple >>>> history, >>>> they SHOULD have nevertheless appeared in the log, with an >>>> explanation >>>> of why they were excluded, provided you have connector debugging >>>> enabled. >>>> >>>> >>>> OK, here is the seed list: >>>> http://www.ibsen.uio.no/ >>>> >>>> >>>> http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml> >>>> > >>>> >>>> http://www.ibsen.uio.no/dikt._**_xhtml<http://www.ibsen.uio.no/dikt.__xhtml> >>>> >>>> <http://www.ibsen.uio.no/dikt.**xhtml<http://www.ibsen.uio.no/dikt.xhtml> >>>> > >>>> >>>> http://www.ibsen.uio.no/brev._**_xhtml<http://www.ibsen.uio.no/brev.__xhtml> >>>> >>>> <http://www.ibsen.uio.no/brev.**xhtml<http://www.ibsen.uio.no/brev.xhtml> >>>> > >>>> >>>> http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml> >>>> > >>>> >>>> http://www.ibsen.uio.no/varia.**__xhtml<http://www.ibsen.uio.no/varia.__xhtml> >>>> >>>> <http://www.ibsen.uio.no/**varia.xhtml<http://www.ibsen.uio.no/varia.xhtml> >>>> > >>>> >>>> http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**undervisningsressurser.xhtml<http://www.ibsen.uio.no/undervisningsressurser.xhtml> >>>> > >>>> >>>> Here is the results from simple history: >>>> 08-12-2013 16:46:26.536 job end 1368534065016(Ibsen) >>>> 0 1 >>>> 08-12-2013 16:46:09.927 document ingest (Solr) >>>> >>>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml> >>>> > >>>> OK 11897 178 >>>> 08-12-2013 16:46:09.751 fetch >>>> >>>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml> >>>> >>>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml> >>>> > >>>> 200 11897 17 >>>> 08-12-2013 16:44:48.829 fetch http://www.ibsen.uio.no/ >>>> 302 0 79484 >>>> 08-12-2013 16:44:48.727 robots parse www.ibsen.uio.no:80 >>>> <http://www.ibsen.uio.no:80> >>>> >>>> HTML 0 2 Robots file contained HTML, skipped >>>> 08-12-2013 16:44:46.574 job start 1368534065016(Ibsen) >>>> 0 1 >>>> 1 >>>> >>>> HttpClient log: >>>> >>>> http://folk.uio.no/erlendfg/__**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log> >>>> >>>> <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log> >>>> > >>>> >>>> Erlend >>>> >>>> >>>> >>> >> >
