Re: Web crawl that doesn't complete and robot.txt error

Karl Wright Mon, 10 Feb 2014 12:56:42 -0800

Read the documentation.  Unless you select "Keep unreachable documents
forever", MCF will keep track of hop count info.


Karl


On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote:

> So, I carefully checked all of our jobs, and *none* have hop filters
> turned on (the text boxes are blank for all jobs).
>
> Still seeing lots of these:
>
> STATEMENT:  INSERT INTO hopdeletedeps
> (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5)
> ERROR:  could not serialize access due to read/write dependencies among
> transactions
>
>
>
> On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Mark,
>>
>> Look here
>> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html ,
>> and read the section on hop filters for the web connector.
>>
>> Karl
>>
>>
>>
>> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote:
>>
>>> We restarted manifold so we'll have to reproduce before we get you more
>>> details.
>>>
>>> I don't understand the hopcount thing. How do you know, and we're is it
>>> set? We're running with default settings pretty much.
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> MCF retries those sorts of errors automatically.  It's possible there's
>>>> a place we missed, but let's pursue other avenues first.
>>>>
>>>> One thing worth noting is that you have hop counting enabled, which is
>>>> fine for small crawls but slows things down a lot (and can cause stalls
>>>> when there are lots of records whose hopcount needs to be updated).  Do you
>>>> truly need link counting?
>>>>
>>>> The thread dump will tell us a lot, as will the simple history.  When
>>>> was the last time something happened in the simple history?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> More info...maybe we don't have postgres configured correctly. Lots of
>>>>> errors to stdout log. For example:
>>>>>
>>>>> STATEMENT:  INSERT INTO intrinsiclink
>>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>>>>> ERROR:  could not serialize access due to read/write dependencies
>>>>> among transactions
>>>>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>>>>> conflict in checking.
>>>>> HINT:  The transaction might succeed if retried.
>>>>>
>>>>> and on other tables as well.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote:
>>>>>
>>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce
>>>>>> with just a single crawl. We were running many at once. Can you describe 
>>>>>> or
>>>>>> point me at instructions for the thread dump you'd like to see?
>>>>>>
>>>>>> We're using 1.4.1.
>>>>>>
>>>>>> The simple history looks clean. All 200s and OKs, with a few broken
>>>>>> pipes, but those documents all seem to have been successfully fetch 
>>>>>> later.
>>>>>> No rejects.
>>>>>>
>>>>>> Thanks again,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> The robots parse error is informational only and does not otherwise
>>>>>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>>>>>
>>>>>>> First question: what version of MCF are you using?  For a time,
>>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever
>>>>>>> connections were used that included certificates.
>>>>>>>
>>>>>>> I suggest that you rule out blocked sites by looking at the simple
>>>>>>> history.  If you see a lot of rejections then maybe you are being 
>>>>>>> blocked.
>>>>>>> If, on the other hand, not much has happened at all for a while, that's 
>>>>>>> not
>>>>>>> the answer.
>>>>>>>
>>>>>>> The fastest way to start diagnosing this problem is to get a thread
>>>>>>> dump.  I'd be happy to look at it and let you know what I find.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote:
>>>>>>>
>>>>>>>> I kicked off a bunch of web crawls on Friday to run over the
>>>>>>>> weekend. They all started fine but didn't finish. No errors in the 
>>>>>>>> logs I
>>>>>>>> can find. All action seemed to stop after a couple of hours. It's
>>>>>>>> configured as complete crawl that runs every 24 hours.
>>>>>>>>
>>>>>>>> I don't expect you to have an answer to what went wrong with such
>>>>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>>>>> bottom
>>>>>>>> of this email).
>>>>>>>>
>>>>>>>> Does it mean robots.txt was not used at all for the crawl, or just
>>>>>>>> that part was ignored? (I kind of expected this kind of error to kill 
>>>>>>>> the
>>>>>>>> crawl, but maybe I just don't understand it.)
>>>>>>>>
>>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>>>>>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to