Hi Meraj,

Can you provide an example URL? explain exactly what you're after? if the page 
you're trying to fetch has a lot of javascript/ajax keep in mind that the 
browsers do a lot of stuff with the downloaded page, for instance when you 
enter a page, the HTML is downloaded, the referenced CSS files are also fetched 
and applied to the HTML (also inline styles, etc.), if any javascript is 
referenced is also downloaded and executed on top of the loaded DOM (also 
inline script tags). The same applies to fonts, etc. The browsers "knows" how 
to deal with all this resources, also the CSS is applied depending on which 
browser you're using. The Nutch crawler only knows about the downloaded HTML 
(similar to what you see when you view the source code of an HTML webpage) it 
doesn't know what a CSS style is, basically the crawler only is interested in: 
the links and the textual/binary content of the webpage, so when a page es 
fetched by Nutch, the HTML is downloaded but the other resources (fonts, 
styles, javascript) are not applied to the fetched page.

Tweaking the http.agent.name property in the nutch-site.xml only will help with 
those sites that change what their response based on the user agent (one for 
mobile and other different for desktop browsers). This approach is being 
replaced by the responsive design, meaning that the user agent is not important 
for how the page is rendered. 

In the current trunk of the upcoming 1.10 version a plugin has been merged that 
could address this, basically this plugin uses selenium to render the page and 
then feed Nutch with the resulting HTML, meaning that ajax/javascript 
interactions will be present in the content that Nutch will parse in the next 
stage. 

Also we need more information about your use case or what you're trying to 
accomplish.

Hope it helps,

Regards,

----- Original Message -----
From: "Meraj A. Khan" <[email protected]>
To: [email protected]
Sent: Friday, February 27, 2015 12:47:06 AM
Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

In some instances the content that is downloaded in Fetch phase from a
HTTP URL is not what you would get if you were to access the request
from a well known browser like Google Chrome for example, that is
because the server is expecting a user agent value that represents a
browser.

There is a http.agent.name property in nutch-site.xml, is it the same
property that should be used to set the user agent to make the server
respond to a Nutch get request the same way as it would for a request
from a browser ? Or is there an another configurable property ?

For example the user agent value for a Chrome browser is below.

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36


Thanks.

Reply via email to