parse-js is already turned on... This definitely creates an interesting problem as more and more sites are using javascript to dynamically create web pages. I was hoping there was a quick and simple answer that I missed, but it sounds like I did not.
Thanks Brad -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, October 21, 2010 11:01 AM To: Alexander Aristov Cc: [email protected]; brad Subject: Re: http.agent and "unsupported browser" Correct. You might try the parse-js plugin, it _attempts_ to retrieve URL's from JavaScript but it may not do the job. > But this won't turn on JavaScript. If a site relyes on it crawling > such won't give useful content. > Best Regards > Alexander Aristov > > On 21 October 2010 20:42, Markus Jelsma <[email protected]> wrote: > > Well, you could set a fake user agent. > > > > > As I crawl more websites I finding I'm encountering more and more > > > > websites > > > > > that reject the crawl by basically redirecting the crawl to an > > > HTML page that that states something along the lines of: > > > > > > HTTP 602 Unsupported Browser The browser you are using (XYZ > > > Spider/0.1 > > > > beta > > > > > (xyz.com search engine; http://www.xyz.com)) > > > > > > or > > > > > > Sorry, but you either have JavaScript turned off or a JavaScript > > > > > > incompatible browser > > > > > > Or > > > > > > Unsupported Browser > > > Browser type and version Generic crawler 0.1 Browser build > > > Platform Unknown Cookies supported False Cookies enabled Disabled > > > JavaScript supported False JavaScript enabled False ActiveX > > > enabled False VBScript enabled False Java applets supported False > > > Etc... > > > > > > > > > Lots of different messages come back, but basically it is > > > rejecting a > > > > crawl > > > > > of the website because of browser incompatibility. > > > > > > Do I have Nutch configured incorrectly? > > > Is there a way to crawl these sites? > > > Recommendations? > > > > > > > > > Thanks > > > Brad

