Hi Mark, All you need to do to get a thread dump is to use top or process monitor to get the process ID of the agents process (the ONLY process if you are using the single-process example). Then, use your jdk's "pstack" command to generate a thread dump:
$JAVA_HOME/bin/jstack <pid> >capture Karl On Mon, Feb 10, 2014 at 2:18 PM, Mark Libucha <[email protected]> wrote: > Thanks Karl, we may take you up on the offer when/if we reproduce with > just a single crawl. We were running many at once. Can you describe or > point me at instructions for the thread dump you'd like to see? > > We're using 1.4.1. > > The simple history looks clean. All 200s and OKs, with a few broken pipes, > but those documents all seem to have been successfully fetch later. No > rejects. > > Thanks again, > > Mark > > > > On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]> wrote: > >> Hi Mark, >> >> The robots parse error is informational only and does not otherwise >> affect crawling. So you will need to look elsewhere for the issue. >> >> First question: what version of MCF are you using? For a time, trunk >> (and the release 1.5 branch) had exactly this problem whenever connections >> were used that included certificates. >> >> I suggest that you rule out blocked sites by looking at the simple >> history. If you see a lot of rejections then maybe you are being blocked. >> If, on the other hand, not much has happened at all for a while, that's not >> the answer. >> >> The fastest way to start diagnosing this problem is to get a thread >> dump. I'd be happy to look at it and let you know what I find. >> >> Karl >> >> >> >> >> >> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]> wrote: >> >>> I kicked off a bunch of web crawls on Friday to run over the weekend. >>> They all started fine but didn't finish. No errors in the logs I can find. >>> All action seemed to stop after a couple of hours. It's configured as >>> complete crawl that runs every 24 hours. >>> >>> I don't expect you to have an answer to what went wrong with such >>> limited information, but I did see a problem with robots.txt (at the bottom >>> of this email). >>> >>> Does it mean robots.txt was not used at all for the crawl, or just that >>> part was ignored? (I kind of expected this kind of error to kill the crawl, >>> but maybe I just don't understand it.) >>> >>> If the crawl were ignoring the robots.txt, or a part of it, and the >>> crawled site banned my crawler, what would I see in the MCF logs? >>> >>> Thanks, >>> >>> Mark >>> >>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>> http://www.somesite.gov/sitemapindex.xml>' >>> >> >> >
