Re: http.agent and "unsupported browser"

Ken Krugler Thu, 21 Oct 2010 16:23:57 -0700


On Oct 21, 2010, at 4:02pm, brad wrote:

Interesting.  Has anyone implement within Nutch?  Performance?

I haven't done this with Nutch, but I have built web mining workflowsusing Bixo that used HtmlUnit to fetch pages.


It typically took 10x longer to get a page, from what I remember.

And you need to be extra careful when you crawl in this manner. Theload you put on their servers is much higher, due to all of the extrafetching of content. And (as Reinhard mentioned) you can get somereally angry webmasters if you wind up skewing their analytics - whichincludes Omniture and friends, not just Google Analytics.


-- Ken

-----Original Message-----
From: reinhard schwab [mailto:[email protected]]
Sent: Thursday, October 21, 2010 1:29 PM
To: [email protected]
Subject: Re: http.agent and "unsupported browser"

parse-js is some heuristic to extract urls from javascript code.
if you want to perform javascript, take a look at htmlunit.
i have used htmlunit for processing html pages containing javascript.
of course, it needs more resources like memory and cpu.

if you use htmlunit, filter out google analytic scripts.
some servers may block you if you change their analytics.

brad schrieb:

parse-js is already turned on...

This definitely creates an interesting problem as more and more sites
are using javascript to dynamically create web pages.  I was hoping

there was a quick and simple answer that I missed, but it soundslike I

did not.

Thanks
Brad

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Thursday, October 21, 2010 11:01 AM
To: Alexander Aristov
Cc: [email protected]; brad
Subject: Re: http.agent and "unsupported browser"

Correct. You might try the parse-js plugin, it _attempts_ to retrieve
URL's from JavaScript but it may not do the job.

But this won't turn on JavaScript. If a site relyes on it crawling
such won't give useful content.
Best Regards
Alexander Aristov

On 21 October 2010 20:42, Markus Jelsma <[email protected]>

wrote:

Well, you could set a fake user agent.

As I crawl more websites I finding I'm encountering more and more

websites

that reject the crawl by basically redirecting the crawl to anHTML
page that that states something along the lines of:

HTTP 602 Unsupported Browser The browser you are using (XYZ
Spider/0.1

beta

(xyz.com search engine; http://www.xyz.com))

or

Sorry, but you either have JavaScript turned off or a JavaScript

incompatible browser

Or

Unsupported Browser
Browser type and version Generic crawler 0.1 Browser buildPlatformUnknown Cookies supported False Cookies enabled DisabledJavaScript
supported False JavaScript enabled False ActiveX enabled False
VBScript enabled False Java applets supported False Etc...
Lots of different messages come back, but basically it isrejecting
a

crawl

of the website because of browser incompatibility.

Do I have Nutch configured incorrectly?
Is there a way to crawl these sites?
Recommendations?


Thanks
Brad


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http.agent and "unsupported browser"

Reply via email to