Hi Mark, Look here manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html , and read the section on hop filters for the web connector.
Karl On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote: > We restarted manifold so we'll have to reproduce before we get you more > details. > > I don't understand the hopcount thing. How do you know, and we're is it > set? We're running with default settings pretty much. > > Thanks, > > Mark > > > On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote: > >> Hi Mark, >> >> MCF retries those sorts of errors automatically. It's possible there's a >> place we missed, but let's pursue other avenues first. >> >> One thing worth noting is that you have hop counting enabled, which is >> fine for small crawls but slows things down a lot (and can cause stalls >> when there are lots of records whose hopcount needs to be updated). Do you >> truly need link counting? >> >> The thread dump will tell us a lot, as will the simple history. When was >> the last time something happened in the simple history? >> >> Karl >> >> >> >> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote: >> >>> More info...maybe we don't have postgres configured correctly. Lots of >>> errors to stdout log. For example: >>> >>> STATEMENT: INSERT INTO intrinsiclink >>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >>> ERROR: could not serialize access due to read/write dependencies among >>> transactions >>> DETAIL: Reason code: Canceled on identification as a pivot, during >>> conflict in checking. >>> HINT: The transaction might succeed if retried. >>> >>> and on other tables as well. >>> >>> Mark >>> >>> >>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote: >>> >>>> Thanks Karl, we may take you up on the offer when/if we reproduce with >>>> just a single crawl. We were running many at once. Can you describe or >>>> point me at instructions for the thread dump you'd like to see? >>>> >>>> We're using 1.4.1. >>>> >>>> The simple history looks clean. All 200s and OKs, with a few broken >>>> pipes, but those documents all seem to have been successfully fetch later. >>>> No rejects. >>>> >>>> Thanks again, >>>> >>>> Mark >>>> >>>> >>>> >>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote: >>>> >>>>> Hi Mark, >>>>> >>>>> The robots parse error is informational only and does not otherwise >>>>> affect crawling. So you will need to look elsewhere for the issue. >>>>> >>>>> First question: what version of MCF are you using? For a time, trunk >>>>> (and the release 1.5 branch) had exactly this problem whenever connections >>>>> were used that included certificates. >>>>> >>>>> I suggest that you rule out blocked sites by looking at the simple >>>>> history. If you see a lot of rejections then maybe you are being blocked. >>>>> If, on the other hand, not much has happened at all for a while, that's >>>>> not >>>>> the answer. >>>>> >>>>> The fastest way to start diagnosing this problem is to get a thread >>>>> dump. I'd be happy to look at it and let you know what I find. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote: >>>>> >>>>>> I kicked off a bunch of web crawls on Friday to run over the weekend. >>>>>> They all started fine but didn't finish. No errors in the logs I can >>>>>> find. >>>>>> All action seemed to stop after a couple of hours. It's configured as >>>>>> complete crawl that runs every 24 hours. >>>>>> >>>>>> I don't expect you to have an answer to what went wrong with such >>>>>> limited information, but I did see a problem with robots.txt (at the >>>>>> bottom >>>>>> of this email). >>>>>> >>>>>> Does it mean robots.txt was not used at all for the crawl, or just >>>>>> that part was ignored? (I kind of expected this kind of error to kill the >>>>>> crawl, but maybe I just don't understand it.) >>>>>> >>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the >>>>>> crawled site banned my crawler, what would I see in the MCF logs? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mark >>>>>> >>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>>> >>>>> >>>>> >>>> >>> >> >
