So, I carefully checked all of our jobs, and *none* have hop filters turned on (the text boxes are blank for all jobs).
Still seeing lots of these: STATEMENT: INSERT INTO hopdeletedeps (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5) ERROR: could not serialize access due to read/write dependencies among transactions On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote: > Hi Mark, > > Look here > manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html , > and read the section on hop filters for the web connector. > > Karl > > > > On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote: > >> We restarted manifold so we'll have to reproduce before we get you more >> details. >> >> I don't understand the hopcount thing. How do you know, and we're is it >> set? We're running with default settings pretty much. >> >> Thanks, >> >> Mark >> >> >> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote: >> >>> Hi Mark, >>> >>> MCF retries those sorts of errors automatically. It's possible there's >>> a place we missed, but let's pursue other avenues first. >>> >>> One thing worth noting is that you have hop counting enabled, which is >>> fine for small crawls but slows things down a lot (and can cause stalls >>> when there are lots of records whose hopcount needs to be updated). Do you >>> truly need link counting? >>> >>> The thread dump will tell us a lot, as will the simple history. When >>> was the last time something happened in the simple history? >>> >>> Karl >>> >>> >>> >>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote: >>> >>>> More info...maybe we don't have postgres configured correctly. Lots of >>>> errors to stdout log. For example: >>>> >>>> STATEMENT: INSERT INTO intrinsiclink >>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >>>> ERROR: could not serialize access due to read/write dependencies among >>>> transactions >>>> DETAIL: Reason code: Canceled on identification as a pivot, during >>>> conflict in checking. >>>> HINT: The transaction might succeed if retried. >>>> >>>> and on other tables as well. >>>> >>>> Mark >>>> >>>> >>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote: >>>> >>>>> Thanks Karl, we may take you up on the offer when/if we reproduce with >>>>> just a single crawl. We were running many at once. Can you describe or >>>>> point me at instructions for the thread dump you'd like to see? >>>>> >>>>> We're using 1.4.1. >>>>> >>>>> The simple history looks clean. All 200s and OKs, with a few broken >>>>> pipes, but those documents all seem to have been successfully fetch later. >>>>> No rejects. >>>>> >>>>> Thanks again, >>>>> >>>>> Mark >>>>> >>>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote: >>>>> >>>>>> Hi Mark, >>>>>> >>>>>> The robots parse error is informational only and does not otherwise >>>>>> affect crawling. So you will need to look elsewhere for the issue. >>>>>> >>>>>> First question: what version of MCF are you using? For a time, trunk >>>>>> (and the release 1.5 branch) had exactly this problem whenever >>>>>> connections >>>>>> were used that included certificates. >>>>>> >>>>>> I suggest that you rule out blocked sites by looking at the simple >>>>>> history. If you see a lot of rejections then maybe you are being >>>>>> blocked. >>>>>> If, on the other hand, not much has happened at all for a while, that's >>>>>> not >>>>>> the answer. >>>>>> >>>>>> The fastest way to start diagnosing this problem is to get a thread >>>>>> dump. I'd be happy to look at it and let you know what I find. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote: >>>>>> >>>>>>> I kicked off a bunch of web crawls on Friday to run over the >>>>>>> weekend. They all started fine but didn't finish. No errors in the logs >>>>>>> I >>>>>>> can find. All action seemed to stop after a couple of hours. It's >>>>>>> configured as complete crawl that runs every 24 hours. >>>>>>> >>>>>>> I don't expect you to have an answer to what went wrong with such >>>>>>> limited information, but I did see a problem with robots.txt (at the >>>>>>> bottom >>>>>>> of this email). >>>>>>> >>>>>>> Does it mean robots.txt was not used at all for the crawl, or just >>>>>>> that part was ignored? (I kind of expected this kind of error to kill >>>>>>> the >>>>>>> crawl, but maybe I just don't understand it.) >>>>>>> >>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the >>>>>>> crawled site banned my crawler, what would I see in the MCF logs? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
