I've opened a ticket - CONNECTORS-764. Karl
On Tue, Aug 13, 2013 at 6:32 PM, Karl Wright <[email protected]> wrote: > I may have a scenario that could trigger the problem. > > (1) Set the max hops for a job relatively low > (2) Crawl > (3) Increase the max hops > (4) Crawl again > > I think under these conditions, it may be that we're not properly removing > the "hop count exceeded" states for documents that were encountered in the > first crawl that were too far from the seeds. > > If this is the problem, it should be easy to confirm. I'm not quite sure > how to fix it yet though - need to do some research. > > Karl > > > On Tue, Aug 13, 2013 at 10:16 AM, Karl Wright <[email protected]> wrote: > >> Hi Erlend, >> >> I see what must be happening. The intrinsiclink table already has the >> link to the skuespill document in it, and because of that, nothing in the >> hopcount world is even getting looked at. So in a nutshell, the problem is >> that somehow the hopcount table's data was messed up, but now there's no >> good way to recover. >> >> I would really like to know how it got messed up in the first place, but >> since there's been a lot of activity on that machine it would be a >> challenge to come up with the exact sequence of events. If you think you >> remember it, please write it down and maybe try it on your test instance. >> But for now, the simplest way to get the production instance back up and >> running is to do the following: >> >> - Note all the job settings and configuration >> - Delete the job >> - Recreate the job >> - Run the job >> >> Since there are very few documents in the job, it does not sound like >> much of a problem to do that. Would this work for you? >> Karl >> >> >> >> On Tue, Aug 13, 2013 at 9:55 AM, Erlend Garåsen >> <[email protected]>wrote: >> >>> On 8/13/13 3:34 PM, Karl Wright wrote: >>> >>> Can you enable hopcount debugging, and rerun? >>>> "org.apache.manifoldcf.**hopcount" set to the value "DEBUG" in >>>> properties.xml. >>>> >>> >>> For some odd reason, MCF does not log anything more with this >>> configuration entry enabled: >>> <property name="org.apache.manifoldcf.**hopcount" value="DEBUG"/> >>> >>> I have double-checked everything - the configuration file is sucessfully >>> read after I restart the Agent process and there is no old processes >>> running (checked with ps command). MCF has responded to every change I have >>> done to properties.xml so far, but not this one. >>> >>> Here's the log output: >>> >>> WARN 2013-08-13 15:50:57,350 (Worker thread '24') - Web: Unknown >>> robots.txt line from 'www.ibsen.uio.no:80': '<?xml version="1.0" >>> encoding="UTF-8"?>' >>> WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown >>> robots.txt line from 'www.ibsen.uio.no:80': '<!DOCTYPE html' >>> WARN 2013-08-13 15:50:57,351 (Worker thread '24') - Web: Unknown >>> robots.txt line from 'www.ibsen.uio.no:80': ' PUBLIC "-//W3C//DTD >>> XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/** >>> DTD/xhtml1-transitional.dtd<http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd> >>> ">' >>> WARN 2013-08-13 15:50:57,352 (Worker thread '24') - Web: Unknown >>> robots.txt line from 'www.ibsen.uio.no:80': '<html >>> saxon-error-attribute="http://**www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml>" >>> xml:lang="no"><head><meta http-equiv="Content-Type" content="text/html; >>> charset=utf-8"> </meta><title>Henrik Ibsens skrifter: >>> Feilmelding</title><link type="text/css" rel="stylesheet" >>> href="rammeverk.css" media="all"/><link type="text/css" rel="stylesheet" >>> href="vitnemouseover.css"/><**link xmlns:tei="http://www.tei-c.** >>> org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS="http://www.example. >>> **org/ns/HIS <http://www.example.org/ns/HIS>" xmlns:exist="http://exist. >>> **sourceforge.net/NS/exist <http://exist.sourceforge.net/NS/exist>" >>> rel="icon" type="image/png" href="icons/favicon.ico"/><**script >>> xmlns:tei="http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>" >>> xmlns:HIS="http://www.example.**org/ns/HIS<http://www.example.org/ns/HIS>" >>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>" >>> src="http://code.jquery.com/**jquery-1.6.2.min.js<http://code.jquery.com/jquery-1.6.2.min.js>" >>> type="text/javascript">return void;</script><script xmlns:tei=" >>> http://www.tei-c.**org/ns/1.0 <http://www.tei-c.org/ns/1.0>" xmlns:HIS=" >>> http://www.example.**org/ns/HIS <http://www.example.org/ns/HIS>" >>> xmlns:exist="http://exist.**sourceforge.net/NS/exist<http://exist.sourceforge.net/NS/exist>" >>> src="jquery-ui-1.8.23.custom.**min.js" type="text/javascript">return >>> void;</script><script type="text/javascript">' >>> INFO 2013-08-13 15:51:02,571 (Worker thread '24') - WEB: FETCH URL| >>> http://www.ibsen.uio.no/|**1376401862447+121|302|0|<http://www.ibsen.uio.no/%7C1376401862447+121%7C302%7C0%7C> >>> INFO 2013-08-13 15:51:05,383 (Worker thread '13') - WEB: FETCH URL| >>> http://www.ibsen.uio.no/**forside.xhtml|1376401865366+**16|200|11897|<http://www.ibsen.uio.no/forside.xhtml%7C1376401865366+16%7C200%7C11897%7C> >>> >>> >>> >>> >> >
