Read the documentation. Unless you select "Keep unreachable documents forever", MCF will keep track of hop count info.
Karl On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote: > So, I carefully checked all of our jobs, and *none* have hop filters > turned on (the text boxes are blank for all jobs). > > Still seeing lots of these: > > STATEMENT: INSERT INTO hopdeletedeps > (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5) > ERROR: could not serialize access due to read/write dependencies among > transactions > > > > On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote: > >> Hi Mark, >> >> Look here >> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html , >> and read the section on hop filters for the web connector. >> >> Karl >> >> >> >> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote: >> >>> We restarted manifold so we'll have to reproduce before we get you more >>> details. >>> >>> I don't understand the hopcount thing. How do you know, and we're is it >>> set? We're running with default settings pretty much. >>> >>> Thanks, >>> >>> Mark >>> >>> >>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote: >>> >>>> Hi Mark, >>>> >>>> MCF retries those sorts of errors automatically. It's possible there's >>>> a place we missed, but let's pursue other avenues first. >>>> >>>> One thing worth noting is that you have hop counting enabled, which is >>>> fine for small crawls but slows things down a lot (and can cause stalls >>>> when there are lots of records whose hopcount needs to be updated). Do you >>>> truly need link counting? >>>> >>>> The thread dump will tell us a lot, as will the simple history. When >>>> was the last time something happened in the simple history? >>>> >>>> Karl >>>> >>>> >>>> >>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote: >>>> >>>>> More info...maybe we don't have postgres configured correctly. Lots of >>>>> errors to stdout log. For example: >>>>> >>>>> STATEMENT: INSERT INTO intrinsiclink >>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >>>>> ERROR: could not serialize access due to read/write dependencies >>>>> among transactions >>>>> DETAIL: Reason code: Canceled on identification as a pivot, during >>>>> conflict in checking. >>>>> HINT: The transaction might succeed if retried. >>>>> >>>>> and on other tables as well. >>>>> >>>>> Mark >>>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote: >>>>> >>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce >>>>>> with just a single crawl. We were running many at once. Can you describe >>>>>> or >>>>>> point me at instructions for the thread dump you'd like to see? >>>>>> >>>>>> We're using 1.4.1. >>>>>> >>>>>> The simple history looks clean. All 200s and OKs, with a few broken >>>>>> pipes, but those documents all seem to have been successfully fetch >>>>>> later. >>>>>> No rejects. >>>>>> >>>>>> Thanks again, >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote: >>>>>> >>>>>>> Hi Mark, >>>>>>> >>>>>>> The robots parse error is informational only and does not otherwise >>>>>>> affect crawling. So you will need to look elsewhere for the issue. >>>>>>> >>>>>>> First question: what version of MCF are you using? For a time, >>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever >>>>>>> connections were used that included certificates. >>>>>>> >>>>>>> I suggest that you rule out blocked sites by looking at the simple >>>>>>> history. If you see a lot of rejections then maybe you are being >>>>>>> blocked. >>>>>>> If, on the other hand, not much has happened at all for a while, that's >>>>>>> not >>>>>>> the answer. >>>>>>> >>>>>>> The fastest way to start diagnosing this problem is to get a thread >>>>>>> dump. I'd be happy to look at it and let you know what I find. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote: >>>>>>> >>>>>>>> I kicked off a bunch of web crawls on Friday to run over the >>>>>>>> weekend. They all started fine but didn't finish. No errors in the >>>>>>>> logs I >>>>>>>> can find. All action seemed to stop after a couple of hours. It's >>>>>>>> configured as complete crawl that runs every 24 hours. >>>>>>>> >>>>>>>> I don't expect you to have an answer to what went wrong with such >>>>>>>> limited information, but I did see a problem with robots.txt (at the >>>>>>>> bottom >>>>>>>> of this email). >>>>>>>> >>>>>>>> Does it mean robots.txt was not used at all for the crawl, or just >>>>>>>> that part was ignored? (I kind of expected this kind of error to kill >>>>>>>> the >>>>>>>> crawl, but maybe I just don't understand it.) >>>>>>>> >>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the >>>>>>>> crawled site banned my crawler, what would I see in the MCF logs? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
