Karl, looks like the hop filter setting has fixed our problem, though there's still a bit more testing to do. Thanks so much for the help.
Mark On Mon, Feb 10, 2014 at 1:12 PM, Mark Libucha <[email protected]> wrote: > Apologies for the RTFM stumble after being pointed to it. Thought I did > read it. Apparently not very carefully. I understand it now. > > Thanks. > > > On Mon, Feb 10, 2014 at 12:55 PM, Karl Wright <[email protected]> wrote: > >> Read the documentation. Unless you select "Keep unreachable documents >> forever", MCF will keep track of hop count info. >> >> Karl >> >> >> On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote: >> >>> So, I carefully checked all of our jobs, and *none* have hop filters >>> turned on (the text boxes are blank for all jobs). >>> >>> Still seeing lots of these: >>> >>> STATEMENT: INSERT INTO hopdeletedeps >>> (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5) >>> ERROR: could not serialize access due to read/write dependencies >>> among transactions >>> >>> >>> >>> On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]>wrote: >>> >>>> Hi Mark, >>>> >>>> Look here >>>> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html, and >>>> read the section on hop filters for the web connector. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]>wrote: >>>> >>>>> We restarted manifold so we'll have to reproduce before we get you >>>>> more details. >>>>> >>>>> I don't understand the hopcount thing. How do you know, and we're is >>>>> it set? We're running with default settings pretty much. >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote: >>>>> >>>>>> Hi Mark, >>>>>> >>>>>> MCF retries those sorts of errors automatically. It's possible >>>>>> there's a place we missed, but let's pursue other avenues first. >>>>>> >>>>>> One thing worth noting is that you have hop counting enabled, which >>>>>> is fine for small crawls but slows things down a lot (and can cause >>>>>> stalls >>>>>> when there are lots of records whose hopcount needs to be updated). Do >>>>>> you >>>>>> truly need link counting? >>>>>> >>>>>> The thread dump will tell us a lot, as will the simple history. When >>>>>> was the last time something happened in the simple history? >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote: >>>>>> >>>>>>> More info...maybe we don't have postgres configured correctly. Lots >>>>>>> of errors to stdout log. For example: >>>>>>> >>>>>>> STATEMENT: INSERT INTO intrinsiclink >>>>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >>>>>>> ERROR: could not serialize access due to read/write dependencies >>>>>>> among transactions >>>>>>> DETAIL: Reason code: Canceled on identification as a pivot, during >>>>>>> conflict in checking. >>>>>>> HINT: The transaction might succeed if retried. >>>>>>> >>>>>>> and on other tables as well. >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce >>>>>>>> with just a single crawl. We were running many at once. Can you >>>>>>>> describe or >>>>>>>> point me at instructions for the thread dump you'd like to see? >>>>>>>> >>>>>>>> We're using 1.4.1. >>>>>>>> >>>>>>>> The simple history looks clean. All 200s and OKs, with a few broken >>>>>>>> pipes, but those documents all seem to have been successfully fetch >>>>>>>> later. >>>>>>>> No rejects. >>>>>>>> >>>>>>>> Thanks again, >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> Hi Mark, >>>>>>>>> >>>>>>>>> The robots parse error is informational only and does not >>>>>>>>> otherwise affect crawling. So you will need to look elsewhere for the >>>>>>>>> issue. >>>>>>>>> >>>>>>>>> First question: what version of MCF are you using? For a time, >>>>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever >>>>>>>>> connections were used that included certificates. >>>>>>>>> >>>>>>>>> I suggest that you rule out blocked sites by looking at the simple >>>>>>>>> history. If you see a lot of rejections then maybe you are being >>>>>>>>> blocked. >>>>>>>>> If, on the other hand, not much has happened at all for a while, >>>>>>>>> that's not >>>>>>>>> the answer. >>>>>>>>> >>>>>>>>> The fastest way to start diagnosing this problem is to get a >>>>>>>>> thread dump. I'd be happy to look at it and let you know what I find. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> I kicked off a bunch of web crawls on Friday to run over the >>>>>>>>>> weekend. They all started fine but didn't finish. No errors in the >>>>>>>>>> logs I >>>>>>>>>> can find. All action seemed to stop after a couple of hours. It's >>>>>>>>>> configured as complete crawl that runs every 24 hours. >>>>>>>>>> >>>>>>>>>> I don't expect you to have an answer to what went wrong with such >>>>>>>>>> limited information, but I did see a problem with robots.txt (at the >>>>>>>>>> bottom >>>>>>>>>> of this email). >>>>>>>>>> >>>>>>>>>> Does it mean robots.txt was not used at all for the crawl, or >>>>>>>>>> just that part was ignored? (I kind of expected this kind of error >>>>>>>>>> to kill >>>>>>>>>> the crawl, but maybe I just don't understand it.) >>>>>>>>>> >>>>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and >>>>>>>>>> the crawled site banned my crawler, what would I see in the MCF logs? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Mark >>>>>>>>>> >>>>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>>>>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
