Hi Mark, The robots parse error is informational only and does not otherwise affect crawling. So you will need to look elsewhere for the issue.
First question: what version of MCF are you using? For a time, trunk (and the release 1.5 branch) had exactly this problem whenever connections were used that included certificates. I suggest that you rule out blocked sites by looking at the simple history. If you see a lot of rejections then maybe you are being blocked. If, on the other hand, not much has happened at all for a while, that's not the answer. The fastest way to start diagnosing this problem is to get a thread dump. I'd be happy to look at it and let you know what I find. Karl On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]> wrote: > I kicked off a bunch of web crawls on Friday to run over the weekend. They > all started fine but didn't finish. No errors in the logs I can find. All > action seemed to stop after a couple of hours. It's configured as complete > crawl that runs every 24 hours. > > I don't expect you to have an answer to what went wrong with such limited > information, but I did see a problem with robots.txt (at the bottom of this > email). > > Does it mean robots.txt was not used at all for the crawl, or just that > part was ignored? (I kind of expected this kind of error to kill the crawl, > but maybe I just don't understand it.) > > If the crawl were ignoring the robots.txt, or a part of it, and the > crawled site banned my crawler, what would I see in the MCF logs? > > Thanks, > > Mark > > 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 > ERRORS 01 Unknown robots.txt line: 'Sitemap: < > http://www.somesite.gov/sitemapindex.xml>' >
