On Oct 21, 2010, at 4:02pm, brad wrote:

Interesting.  Has anyone implement within Nutch?  Performance?

I haven't done this with Nutch, but I have built web mining workflows using Bixo that used HtmlUnit to fetch pages.

It typically took 10x longer to get a page, from what I remember.

And you need to be extra careful when you crawl in this manner. The load you put on their servers is much higher, due to all of the extra fetching of content. And (as Reinhard mentioned) you can get some really angry webmasters if you wind up skewing their analytics - which includes Omniture and friends, not just Google Analytics.

-- Ken

-----Original Message-----
From: reinhard schwab [mailto:[email protected]]
Sent: Thursday, October 21, 2010 1:29 PM
To: [email protected]
Subject: Re: http.agent and "unsupported browser"

parse-js is some heuristic to extract urls from javascript code.
if you want to perform javascript, take a look at htmlunit.
i have used htmlunit for processing html pages containing javascript.
of course, it needs more resources like memory and cpu.

if you use htmlunit, filter out google analytic scripts.
some servers may block you if you change their analytics.

brad schrieb:
parse-js is already turned on...

This definitely creates an interesting problem as more and more sites
are using javascript to dynamically create web pages.  I was hoping
there was a quick and simple answer that I missed, but it sounds like I
did not.

Thanks
Brad

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Thursday, October 21, 2010 11:01 AM
To: Alexander Aristov
Cc: [email protected]; brad
Subject: Re: http.agent and "unsupported browser"

Correct. You might try the parse-js plugin, it _attempts_ to retrieve
URL's from JavaScript but it may not do the job.


But this won't turn on JavaScript. If a site relyes on it crawling
such won't give useful content.
Best Regards
Alexander Aristov

On 21 October 2010 20:42, Markus Jelsma <[email protected]>

wrote:

Well, you could set a fake user agent.


As I crawl more websites I finding I'm encountering more and more

websites


that reject the crawl by basically redirecting the crawl to an HTML
page that that states something along the lines of:

HTTP 602 Unsupported Browser The browser you are using (XYZ
Spider/0.1

beta


(xyz.com search engine; http://www.xyz.com))

or

Sorry, but you either have JavaScript turned off or a JavaScript

incompatible browser

Or

Unsupported Browser
Browser type and version Generic crawler 0.1 Browser build Platform Unknown Cookies supported False Cookies enabled Disabled JavaScript
supported False JavaScript enabled False ActiveX enabled False
VBScript enabled False Java applets supported False Etc...


Lots of different messages come back, but basically it is rejecting
a

crawl


of the website because of browser incompatibility.

Do I have Nutch configured incorrectly?
Is there a way to crawl these sites?
Recommendations?


Thanks
Brad







--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to