Re: Hop count problem

Karl Wright Tue, 13 Aug 2013 02:56:55 -0700

What version of ManifoldCF is this?

I ask because I updated the logging output in 1.3 to capture a number of
cases that previously did not log a reason why they were skipped.


Karl



On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen <[email protected]>wrote:

>
> OK, I have now changed the log level from INFO to DEBUG for connectors as
> well. Here's the log:
> http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>
> The following entry indicates that one of the missing URLs is
> found/extracted from a link:
> DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html document '
> http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>',
> found link to 
> 'http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
> '
>
> Then the job just ends and all the extracted links were never fetched.
>
> Erlend
>
>
> On 8/12/13 5:15 PM, Erlend Garåsen wrote:
>
>>
>> Thanks, I will tomorrow and report thereafter. I hope we will find a
>> simple explanation. :)
>>
>> E
>>
>> On 8/12/13 5:07 PM, Karl Wright wrote:
>>
>>> Hi Erlend,
>>>
>>> You have wire logging (httpclient) enabled, which is useful for
>>> debugging fetch issues, but you do not have connector debugging on.  To
>>> turn it on, add this to properties.xml:
>>>
>>> <property name="org.apache.manifoldcf.**connectors" value="DEBUG"/>
>>>
>>> thanks,
>>> Karl
>>>
>>>
>>> On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
>>> <[email protected] 
>>> <mailto:[email protected].**no<[email protected]>>>
>>> wrote:
>>>
>>>     On 8/12/13 4:29 PM, Karl Wright wrote:
>>>
>>>         Hi Erlend,
>>>
>>>         The Document Status report shows these documents because they
>>>         are still
>>>         in the queue.  The reasons for this could be several.  Documents
>>>         that
>>>         exceed the hopcount by 1 level are allowed to remain in the
>>>         queue for
>>>         bookkeeping purposes.  "scheduled date" as given only meaningful
>>>         if the
>>>         document is in an active state; my guess is that these documents
>>>         are not
>>>         in fact in that state, but rather in the state
>>>         HOPCOUNT_EXCEEDED.  Can
>>>         you include one complete row from the Document Status report for
>>>         one of
>>>         the missing documents?
>>>
>>>
>>>     For 
>>> "http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>>> >":
>>>     Job: Ibsen
>>>
>>>     State: Out of scope
>>>     Status: Hopcount exceeded
>>>     Scheduled: 01-01-1970 01:00:00.000
>>>     Scheduled action: Process
>>>     Retry count: N/A
>>>     Retry limit: N/A
>>>
>>>
>>>         When you added documents to the seed list, what did the Simple
>>>         History
>>>         say when they were fetched?  If they don't appear in the simple
>>>         history,
>>>         they SHOULD have nevertheless appeared in the log, with an
>>>         explanation
>>>         of why they were excluded, provided you have connector debugging
>>>         enabled.
>>>
>>>
>>>     OK, here is the seed list:
>>>     http://www.ibsen.uio.no/
>>>
>>>     
>>> http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>>> >
>>>     
>>> http://www.ibsen.uio.no/dikt._**_xhtml<http://www.ibsen.uio.no/dikt.__xhtml>
>>>     
>>> <http://www.ibsen.uio.no/dikt.**xhtml<http://www.ibsen.uio.no/dikt.xhtml>
>>> >
>>>     
>>> http://www.ibsen.uio.no/brev._**_xhtml<http://www.ibsen.uio.no/brev.__xhtml>
>>>     
>>> <http://www.ibsen.uio.no/brev.**xhtml<http://www.ibsen.uio.no/brev.xhtml>
>>> >
>>>     
>>> http://www.ibsen.uio.no/__**sakprosa.xhtml<http://www.ibsen.uio.no/__sakprosa.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**sakprosa.xhtml<http://www.ibsen.uio.no/sakprosa.xhtml>
>>> >
>>>     
>>> http://www.ibsen.uio.no/varia.**__xhtml<http://www.ibsen.uio.no/varia.__xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**varia.xhtml<http://www.ibsen.uio.no/varia.xhtml>
>>> >
>>>     
>>> http://www.ibsen.uio.no/__**undervisningsressurser.xhtml<http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**undervisningsressurser.xhtml<http://www.ibsen.uio.no/undervisningsressurser.xhtml>
>>> >
>>>
>>>     Here is the results from simple history:
>>>     08-12-2013 16:46:26.536         job end         1368534065016(Ibsen)
>>>                      0       1
>>>     08-12-2013 16:46:09.927         document ingest (Solr)
>>>     
>>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>>> >
>>>              OK      11897   178
>>>     08-12-2013 16:46:09.751         fetch
>>>     
>>> http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>>     
>>> <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>>> >
>>>              200     11897   17
>>>     08-12-2013 16:44:48.829         fetch http://www.ibsen.uio.no/
>>>              302     0       79484
>>>     08-12-2013 16:44:48.727         robots parse www.ibsen.uio.no:80
>>>     <http://www.ibsen.uio.no:80>
>>>
>>>              HTML    0       2       Robots file contained HTML, skipped
>>>     08-12-2013 16:44:46.574         job start       1368534065016(Ibsen)
>>>                      0       1
>>>              1
>>>
>>>     HttpClient log:
>>>     
>>> http://folk.uio.no/erlendfg/__**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
>>>     
>>> <http://folk.uio.no/erlendfg/**manifoldcf/manifoldcf.log<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
>>> >
>>>
>>>     Erlend
>>>
>>>
>>>
>>
>

Re: Hop count problem

Reply via email to