Re: Web crawl that doesn't complete and robot.txt error

Mark Libucha Mon, 10 Feb 2014 11:56:19 -0800

We restarted manifold so we'll have to reproduce before we get you more
details.


I don't understand the hopcount thing. How do you know, and we're is it
set? We're running with default settings pretty much.

Thanks,

Mark


On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote:

> Hi Mark,
>
> MCF retries those sorts of errors automatically.  It's possible there's a
> place we missed, but let's pursue other avenues first.
>
> One thing worth noting is that you have hop counting enabled, which is
> fine for small crawls but slows things down a lot (and can cause stalls
> when there are lots of records whose hopcount needs to be updated).  Do you
> truly need link counting?
>
> The thread dump will tell us a lot, as will the simple history.  When was
> the last time something happened in the simple history?
>
> Karl
>
>
>
> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote:
>
>> More info...maybe we don't have postgres configured correctly. Lots of
>> errors to stdout log. For example:
>>
>> STATEMENT:  INSERT INTO intrinsiclink
>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>> ERROR:  could not serialize access due to read/write dependencies among
>> transactions
>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>> conflict in checking.
>> HINT:  The transaction might succeed if retried.
>>
>> and on other tables as well.
>>
>> Mark
>>
>>
>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote:
>>
>>> Thanks Karl, we may take you up on the offer when/if we reproduce with
>>> just a single crawl. We were running many at once. Can you describe or
>>> point me at instructions for the thread dump you'd like to see?
>>>
>>> We're using 1.4.1.
>>>
>>> The simple history looks clean. All 200s and OKs, with a few broken
>>> pipes, but those documents all seem to have been successfully fetch later.
>>> No rejects.
>>>
>>> Thanks again,
>>>
>>> Mark
>>>
>>>
>>>
>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> The robots parse error is informational only and does not otherwise
>>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>>
>>>> First question: what version of MCF are you using?  For a time, trunk
>>>> (and the release 1.5 branch) had exactly this problem whenever connections
>>>> were used that included certificates.
>>>>
>>>> I suggest that you rule out blocked sites by looking at the simple
>>>> history.  If you see a lot of rejections then maybe you are being blocked.
>>>> If, on the other hand, not much has happened at all for a while, that's not
>>>> the answer.
>>>>
>>>> The fastest way to start diagnosing this problem is to get a thread
>>>> dump.  I'd be happy to look at it and let you know what I find.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> I kicked off a bunch of web crawls on Friday to run over the weekend.
>>>>> They all started fine but didn't finish. No errors in the logs I can find.
>>>>> All action seemed to stop after a couple of hours. It's configured as
>>>>> complete crawl that runs every 24 hours.
>>>>>
>>>>> I don't expect you to have an answer to what went wrong with such
>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>> bottom
>>>>> of this email).
>>>>>
>>>>> Does it mean robots.txt was not used at all for the crawl, or just
>>>>> that part was ignored? (I kind of expected this kind of error to kill the
>>>>> crawl, but maybe I just don't understand it.)
>>>>>
>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>>>>
>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to