Re: http.agent and "unsupported browser"

Markus Jelsma Thu, 21 Oct 2010 11:01:26 -0700

Correct. You might try the parse-js plugin, it _attempts_ to retrieve URL's 
from JavaScript but it may not do the job.


> But this won't turn on JavaScript. If a site relyes on it crawling such
> won't give useful content.
> Best Regards
> Alexander Aristov
> 
> On 21 October 2010 20:42, Markus Jelsma <[email protected]> wrote:
> > Well, you could set a fake user agent.
> > 
> > > As I crawl more websites I finding I'm encountering more and more
> > 
> > websites
> > 
> > > that reject the crawl by basically redirecting the crawl to an HTML
> > > page that that states something along the lines of:
> > > 
> > > HTTP 602 Unsupported Browser The browser you are using (XYZ Spider/0.1
> > 
> > beta
> > 
> > > (xyz.com search engine; http://www.xyz.com))
> > > 
> > > or
> > > 
> > >  Sorry, but you either have JavaScript turned off or a JavaScript
> > > 
> > > incompatible browser
> > > 
> > > Or
> > > 
> > > Unsupported Browser
> > > Browser type and version Generic crawler 0.1
> > > Browser build Platform Unknown
> > > Cookies supported False
> > > Cookies enabled Disabled
> > > JavaScript supported False
> > > JavaScript enabled False
> > > ActiveX enabled False
> > > VBScript enabled False
> > > Java applets supported False
> > > Etc...
> > > 
> > > 
> > > Lots of different messages come back, but basically it is rejecting a
> > 
> > crawl
> > 
> > > of the website because of browser incompatibility.
> > > 
> > > Do I have Nutch configured incorrectly?
> > > Is there a way to crawl these sites?
> > > Recommendations?
> > > 
> > > 
> > > Thanks
> > > Brad

Re: http.agent and "unsupported browser"

Reply via email to