I couldn't find more information in the log after the upgrade. Yes, I'm
running version 1.3 now since I had to log in after the upgrade:
http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
I tried to fetch one of the missing documents by using Curl from our
prod server. Looks like an OK response to me even though this is Curl
and not HttpClient:
-bash-3.2$ curl -vvv -H "User-Agent: Mozilla/5.0
(ApacheManifoldCFWebCrawler; [email protected])"
"http://www.ibsen.uio.no/sakprosa.xhtml"
* About to connect() to www.ibsen.uio.no port 80
* Trying 129.240.7.27... connected
* Connected to www.ibsen.uio.no (129.240.7.27) port 80
> GET /sakprosa.xhtml HTTP/1.1
> Host: www.ibsen.uio.no
> Accept: */*
> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
[email protected])
>
< HTTP/1.1 200 OK
< Date: Tue, 13 Aug 2013 12:40:02 GMT
< Server: Apache-Coyote/1.1
< X-Cocoon-Version: 2.1.12-dev
< Last-Modified: Fri, 09 Aug 2013 09:57:43 GMT
< Content-Type: text/html
< Content-Length: 11209
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
[...]
E
On 8/13/13 12:04 PM, Karl Wright wrote:
If this is still 1.2, then these were the unlogged reasons why a
document could be skipped:
(1) Length too long
(2) Output connector rejects mime type
(3) Output connector rejects url
(4) Document is not considered indexable according to the job
constraints (the "indexable" regular expressions)
Karl
On Tue, Aug 13, 2013 at 5:56 AM, Karl Wright <[email protected]
<mailto:[email protected]>> wrote:
What version of ManifoldCF is this?
I ask because I updated the logging output in 1.3 to capture a
number of cases that previously did not log a reason why they were
skipped.
Karl
On Tue, Aug 13, 2013 at 5:27 AM, Erlend Garåsen
<[email protected] <mailto:[email protected]>> wrote:
OK, I have now changed the log level from INFO to DEBUG for
connectors as well. Here's the log:
http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>
The following entry indicates that one of the missing URLs is
found/extracted from a link:
DEBUG 2013-08-13 10:58:48,630 (Worker thread '9') - WEB: In html
document 'http://www.ibsen.uio.no/__forside.xhtml
<http://www.ibsen.uio.no/forside.xhtml>', found link to
'http://www.ibsen.uio.no/__skuespill.xhtml
<http://www.ibsen.uio.no/skuespill.xhtml>'
Then the job just ends and all the extracted links were never
fetched.
Erlend
On 8/12/13 5:15 PM, Erlend Garåsen wrote:
Thanks, I will tomorrow and report thereafter. I hope we
will find a
simple explanation. :)
E
On 8/12/13 5:07 PM, Karl Wright wrote:
Hi Erlend,
You have wire logging (httpclient) enabled, which is
useful for
debugging fetch issues, but you do not have connector
debugging on. To
turn it on, add this to properties.xml:
<property name="org.apache.manifoldcf.__connectors"
value="DEBUG"/>
thanks,
Karl
On Mon, Aug 12, 2013 at 10:53 AM, Erlend Garåsen
<[email protected]
<mailto:[email protected]>
<mailto:[email protected].__no
<mailto:[email protected]>>> wrote:
On 8/12/13 4:29 PM, Karl Wright wrote:
Hi Erlend,
The Document Status report shows these
documents because they
are still
in the queue. The reasons for this could be
several. Documents
that
exceed the hopcount by 1 level are allowed to
remain in the
queue for
bookkeeping purposes. "scheduled date" as
given only meaningful
if the
document is in an active state; my guess is
that these documents
are not
in fact in that state, but rather in the state
HOPCOUNT_EXCEEDED. Can
you include one complete row from the Document
Status report for
one of
the missing documents?
For "http://www.ibsen.uio.no/____sakprosa.xhtml
<http://www.ibsen.uio.no/__sakprosa.xhtml>
<http://www.ibsen.uio.no/__sakprosa.xhtml
<http://www.ibsen.uio.no/sakprosa.xhtml>>":
Job: Ibsen
State: Out of scope
Status: Hopcount exceeded
Scheduled: 01-01-1970 01:00:00.000
Scheduled action: Process
Retry count: N/A
Retry limit: N/A
When you added documents to the seed list, what
did the Simple
History
say when they were fetched? If they don't
appear in the simple
history,
they SHOULD have nevertheless appeared in the
log, with an
explanation
of why they were excluded, provided you have
connector debugging
enabled.
OK, here is the seed list:
http://www.ibsen.uio.no/
http://www.ibsen.uio.no/____skuespill.xhtml
<http://www.ibsen.uio.no/__skuespill.xhtml>
<http://www.ibsen.uio.no/__skuespill.xhtml
<http://www.ibsen.uio.no/skuespill.xhtml>>
http://www.ibsen.uio.no/dikt.____xhtml
<http://www.ibsen.uio.no/dikt.__xhtml>
<http://www.ibsen.uio.no/dikt.__xhtml
<http://www.ibsen.uio.no/dikt.xhtml>>
http://www.ibsen.uio.no/brev.____xhtml
<http://www.ibsen.uio.no/brev.__xhtml>
<http://www.ibsen.uio.no/brev.__xhtml
<http://www.ibsen.uio.no/brev.xhtml>>
http://www.ibsen.uio.no/____sakprosa.xhtml
<http://www.ibsen.uio.no/__sakprosa.xhtml>
<http://www.ibsen.uio.no/__sakprosa.xhtml
<http://www.ibsen.uio.no/sakprosa.xhtml>>
http://www.ibsen.uio.no/varia.____xhtml
<http://www.ibsen.uio.no/varia.__xhtml>
<http://www.ibsen.uio.no/__varia.xhtml
<http://www.ibsen.uio.no/varia.xhtml>>
http://www.ibsen.uio.no/____undervisningsressurser.xhtml
<http://www.ibsen.uio.no/__undervisningsressurser.xhtml>
<http://www.ibsen.uio.no/__undervisningsressurser.xhtml
<http://www.ibsen.uio.no/undervisningsressurser.xhtml>>
Here is the results from simple history:
08-12-2013 16:46:26.536 job end
1368534065016(Ibsen)
0 1
08-12-2013 16:46:09.927 document ingest (Solr)
http://www.ibsen.uio.no/____forside.xhtml
<http://www.ibsen.uio.no/__forside.xhtml>
<http://www.ibsen.uio.no/__forside.xhtml
<http://www.ibsen.uio.no/forside.xhtml>>
OK 11897 178
08-12-2013 16:46:09.751 fetch
http://www.ibsen.uio.no/____forside.xhtml
<http://www.ibsen.uio.no/__forside.xhtml>
<http://www.ibsen.uio.no/__forside.xhtml
<http://www.ibsen.uio.no/forside.xhtml>>
200 11897 17
08-12-2013 16:44:48.829 fetch
http://www.ibsen.uio.no/
302 0 79484
08-12-2013 16:44:48.727 robots parse
www.ibsen.uio.no:80 <http://www.ibsen.uio.no:80>
<http://www.ibsen.uio.no:80>
HTML 0 2 Robots file
contained HTML, skipped
08-12-2013 16:44:46.574 job start
1368534065016(Ibsen)
0 1
1
HttpClient log:
http://folk.uio.no/erlendfg/____manifoldcf/manifoldcf.log
<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log>
<http://folk.uio.no/erlendfg/__manifoldcf/manifoldcf.log
<http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log>>
Erlend