parse-js is already turned on...

This definitely creates an interesting problem as more and more sites are
using javascript to dynamically create web pages.  I was hoping there was a
quick and simple answer that I missed, but it sounds like I did not.

Thanks
Brad 

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Thursday, October 21, 2010 11:01 AM
To: Alexander Aristov
Cc: [email protected]; brad
Subject: Re: http.agent and "unsupported browser"

Correct. You might try the parse-js plugin, it _attempts_ to retrieve URL's
from JavaScript but it may not do the job.

> But this won't turn on JavaScript. If a site relyes on it crawling 
> such won't give useful content.
> Best Regards
> Alexander Aristov
> 
> On 21 October 2010 20:42, Markus Jelsma <[email protected]>
wrote:
> > Well, you could set a fake user agent.
> > 
> > > As I crawl more websites I finding I'm encountering more and more
> > 
> > websites
> > 
> > > that reject the crawl by basically redirecting the crawl to an 
> > > HTML page that that states something along the lines of:
> > > 
> > > HTTP 602 Unsupported Browser The browser you are using (XYZ
> > > Spider/0.1
> > 
> > beta
> > 
> > > (xyz.com search engine; http://www.xyz.com))
> > > 
> > > or
> > > 
> > >  Sorry, but you either have JavaScript turned off or a JavaScript
> > > 
> > > incompatible browser
> > > 
> > > Or
> > > 
> > > Unsupported Browser
> > > Browser type and version Generic crawler 0.1 Browser build 
> > > Platform Unknown Cookies supported False Cookies enabled Disabled 
> > > JavaScript supported False JavaScript enabled False ActiveX 
> > > enabled False VBScript enabled False Java applets supported False 
> > > Etc...
> > > 
> > > 
> > > Lots of different messages come back, but basically it is 
> > > rejecting a
> > 
> > crawl
> > 
> > > of the website because of browser incompatibility.
> > > 
> > > Do I have Nutch configured incorrectly?
> > > Is there a way to crawl these sites?
> > > Recommendations?
> > > 
> > > 
> > > Thanks
> > > Brad

Reply via email to