Thanks. 

-----Original Message-----
From: Ken Krugler [mailto:[email protected]] 
Sent: Thursday, October 21, 2010 4:23 PM
To: [email protected]
Subject: Re: http.agent and "unsupported browser"


On Oct 21, 2010, at 4:02pm, brad wrote:

> Interesting.  Has anyone implement within Nutch?  Performance?

I haven't done this with Nutch, but I have built web mining workflows using
Bixo that used HtmlUnit to fetch pages.

It typically took 10x longer to get a page, from what I remember.

And you need to be extra careful when you crawl in this manner. The load you
put on their servers is much higher, due to all of the extra fetching of
content. And (as Reinhard mentioned) you can get some really angry
webmasters if you wind up skewing their analytics - which includes Omniture
and friends, not just Google Analytics.

-- Ken

> -----Original Message-----
> From: reinhard schwab [mailto:[email protected]]
> Sent: Thursday, October 21, 2010 1:29 PM
> To: [email protected]
> Subject: Re: http.agent and "unsupported browser"
>
> parse-js is some heuristic to extract urls from javascript code.
> if you want to perform javascript, take a look at htmlunit.
> i have used htmlunit for processing html pages containing javascript.
> of course, it needs more resources like memory and cpu.
>
> if you use htmlunit, filter out google analytic scripts.
> some servers may block you if you change their analytics.
>
> brad schrieb:
>> parse-js is already turned on...
>>
>> This definitely creates an interesting problem as more and more sites 
>> are using javascript to dynamically create web pages.  I was hoping 
>> there was a quick and simple answer that I missed, but it sounds like 
>> I
> did not.
>>
>> Thanks
>> Brad
>>
>> -----Original Message-----
>> From: Markus Jelsma [mailto:[email protected]]
>> Sent: Thursday, October 21, 2010 11:01 AM
>> To: Alexander Aristov
>> Cc: [email protected]; brad
>> Subject: Re: http.agent and "unsupported browser"
>>
>> Correct. You might try the parse-js plugin, it _attempts_ to retrieve 
>> URL's from JavaScript but it may not do the job.
>>
>>
>>> But this won't turn on JavaScript. If a site relyes on it crawling 
>>> such won't give useful content.
>>> Best Regards
>>> Alexander Aristov
>>>
>>> On 21 October 2010 20:42, Markus Jelsma <[email protected]>
>>>
>> wrote:
>>
>>>> Well, you could set a fake user agent.
>>>>
>>>>
>>>>> As I crawl more websites I finding I'm encountering more and more
>>>>>
>>>> websites
>>>>
>>>>
>>>>> that reject the crawl by basically redirecting the crawl to an 
>>>>> HTML page that that states something along the lines of:
>>>>>
>>>>> HTTP 602 Unsupported Browser The browser you are using (XYZ
>>>>> Spider/0.1
>>>>>
>>>> beta
>>>>
>>>>
>>>>> (xyz.com search engine; http://www.xyz.com))
>>>>>
>>>>> or
>>>>>
>>>>> Sorry, but you either have JavaScript turned off or a JavaScript
>>>>>
>>>>> incompatible browser
>>>>>
>>>>> Or
>>>>>
>>>>> Unsupported Browser
>>>>> Browser type and version Generic crawler 0.1 Browser build 
>>>>> Platform Unknown Cookies supported False Cookies enabled Disabled 
>>>>> JavaScript supported False JavaScript enabled False ActiveX 
>>>>> enabled False VBScript enabled False Java applets supported False 
>>>>> Etc...
>>>>>
>>>>>
>>>>> Lots of different messages come back, but basically it is  
>>>>> rejecting
>>>>> a
>>>>>
>>>> crawl
>>>>
>>>>
>>>>> of the website because of browser incompatibility.
>>>>>
>>>>> Do I have Nutch configured incorrectly?
>>>>> Is there a way to crawl these sites?
>>>>> Recommendations?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Brad
>>>>>
>>
>>
>>
>
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Reply via email to