Re: Web crawl that doesn't complete and robot.txt error

Mark Libucha Mon, 10 Feb 2014 12:19:19 -0800

So, I carefully checked all of our jobs, and *none* have hop filters turned
on (the text boxes are blank for all jobs).


Still seeing lots of these:

STATEMENT:  INSERT INTO hopdeletedeps
(parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5)
ERROR:  could not serialize access due to read/write dependencies among
transactions



On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote:

> Hi Mark,
>
> Look here
> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html ,
> and read the section on hop filters for the web connector.
>
> Karl
>
>
>
> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote:
>
>> We restarted manifold so we'll have to reproduce before we get you more
>> details.
>>
>> I don't understand the hopcount thing. How do you know, and we're is it
>> set? We're running with default settings pretty much.
>>
>> Thanks,
>>
>> Mark
>>
>>
>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Mark,
>>>
>>> MCF retries those sorts of errors automatically.  It's possible there's
>>> a place we missed, but let's pursue other avenues first.
>>>
>>> One thing worth noting is that you have hop counting enabled, which is
>>> fine for small crawls but slows things down a lot (and can cause stalls
>>> when there are lots of records whose hopcount needs to be updated).  Do you
>>> truly need link counting?
>>>
>>> The thread dump will tell us a lot, as will the simple history.  When
>>> was the last time something happened in the simple history?
>>>
>>> Karl
>>>
>>>
>>>
>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote:
>>>
>>>> More info...maybe we don't have postgres configured correctly. Lots of
>>>> errors to stdout log. For example:
>>>>
>>>> STATEMENT:  INSERT INTO intrinsiclink
>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>>>> ERROR:  could not serialize access due to read/write dependencies among
>>>> transactions
>>>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>>>> conflict in checking.
>>>> HINT:  The transaction might succeed if retried.
>>>>
>>>> and on other tables as well.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce with
>>>>> just a single crawl. We were running many at once. Can you describe or
>>>>> point me at instructions for the thread dump you'd like to see?
>>>>>
>>>>> We're using 1.4.1.
>>>>>
>>>>> The simple history looks clean. All 200s and OKs, with a few broken
>>>>> pipes, but those documents all seem to have been successfully fetch later.
>>>>> No rejects.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> The robots parse error is informational only and does not otherwise
>>>>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>>>>
>>>>>> First question: what version of MCF are you using?  For a time, trunk
>>>>>> (and the release 1.5 branch) had exactly this problem whenever 
>>>>>> connections
>>>>>> were used that included certificates.
>>>>>>
>>>>>> I suggest that you rule out blocked sites by looking at the simple
>>>>>> history.  If you see a lot of rejections then maybe you are being 
>>>>>> blocked.
>>>>>> If, on the other hand, not much has happened at all for a while, that's 
>>>>>> not
>>>>>> the answer.
>>>>>>
>>>>>> The fastest way to start diagnosing this problem is to get a thread
>>>>>> dump.  I'd be happy to look at it and let you know what I find.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote:
>>>>>>
>>>>>>> I kicked off a bunch of web crawls on Friday to run over the
>>>>>>> weekend. They all started fine but didn't finish. No errors in the logs 
>>>>>>> I
>>>>>>> can find. All action seemed to stop after a couple of hours. It's
>>>>>>> configured as complete crawl that runs every 24 hours.
>>>>>>>
>>>>>>> I don't expect you to have an answer to what went wrong with such
>>>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>>>> bottom
>>>>>>> of this email).
>>>>>>>
>>>>>>> Does it mean robots.txt was not used at all for the crawl, or just
>>>>>>> that part was ignored? (I kind of expected this kind of error to kill 
>>>>>>> the
>>>>>>> crawl, but maybe I just don't understand it.)
>>>>>>>
>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>>>>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to