Re: Web crawl that doesn't complete and robot.txt error

Karl Wright Mon, 10 Feb 2014 11:22:27 -0800

Hi Mark,

All you need to do to get a thread dump is to use top or process monitor to
get the process ID of the agents process (the ONLY process if you are using
the single-process example).  Then, use your jdk's "pstack" command to
generate a thread dump:


$JAVA_HOME/bin/jstack <pid> >capture

Karl


On Mon, Feb 10, 2014 at 2:18 PM, Mark Libucha <[email protected]> wrote:

> Thanks Karl, we may take you up on the offer when/if we reproduce with
> just a single crawl. We were running many at once. Can you describe or
> point me at instructions for the thread dump you'd like to see?
>
> We're using 1.4.1.
>
> The simple history looks clean. All 200s and OKs, with a few broken pipes,
> but those documents all seem to have been successfully fetch later. No
> rejects.
>
> Thanks again,
>
> Mark
>
>
>
> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Mark,
>>
>> The robots parse error is informational only and does not otherwise
>> affect crawling.  So you will need to look elsewhere for the issue.
>>
>> First question: what version of MCF are you using?  For a time, trunk
>> (and the release 1.5 branch) had exactly this problem whenever connections
>> were used that included certificates.
>>
>> I suggest that you rule out blocked sites by looking at the simple
>> history.  If you see a lot of rejections then maybe you are being blocked.
>> If, on the other hand, not much has happened at all for a while, that's not
>> the answer.
>>
>> The fastest way to start diagnosing this problem is to get a thread
>> dump.  I'd be happy to look at it and let you know what I find.
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]> wrote:
>>
>>> I kicked off a bunch of web crawls on Friday to run over the weekend.
>>> They all started fine but didn't finish. No errors in the logs I can find.
>>> All action seemed to stop after a couple of hours. It's configured as
>>> complete crawl that runs every 24 hours.
>>>
>>> I don't expect you to have an answer to what went wrong with such
>>> limited information, but I did see a problem with robots.txt (at the bottom
>>> of this email).
>>>
>>> Does it mean robots.txt was not used at all for the crawl, or just that
>>> part was ignored? (I kind of expected this kind of error to kill the crawl,
>>> but maybe I just don't understand it.)
>>>
>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>> http://www.somesite.gov/sitemapindex.xml>'
>>>
>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to