Thanks Jorge, I appreciate your help.
On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González <[email protected]> wrote: > The general answer is: it dependes, usually is "polite" to present your robot > to the website so the webmaster knows what is accessing the site, this is why > google and a lot of other search engines (big and small) use a distinctive > name for their crawlers/bots. That being said, the first site that you > mention works fine for a quick parsechecker that I've executed: > > ➜ local bin/nutch parsechecker > http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod > fetching: > http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod > parsing: > http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod > contentType: text/html > signature: 8e90c6d581f27c36828d433f746e4d7a > --------- > Url > --------------- > > http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod > --------- > ParseData > --------- > > Version: 5 > Status: success(1,0) > Title: "Dressing for the Dark" > Outlinks: 151 > outlink: toUrl: > http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css > anchor: > outlink: toUrl: > http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor: > outlink: toUrl: > http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor: > outlink: toUrl: > http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor: > ... > > (trimmed due length) > > As for the second one I wasn't able to do a test, the provided blocks access > from my IP/country: > > This request is blocked by the SonicWALL Gateway Geo IP Service. > Country Name:Cuba. > > Reading your experience with this website, looks like an error in the website > programming, basically I'm assuming they are saying if your User Agent is not > X,Y or Z then serve the mobile version, this could worth reporting. > > Trying to fool the website giving the impression that your bot is a regular > user by tweaking the user agent could work for now, but could draw in > webmaster's attention and could be a cause for blocking your access, this > depends a lot on the webmaster :). But for your particular case could be your > only solution if the webmaster doesn't have a problem with the increase in > traffic. > > Regards, > > ----- Original Message ----- > From: "Meraj A. Khan" <[email protected]> > To: [email protected] > Sent: Saturday, February 28, 2015 12:09:47 AM > Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a > browser? > > Hi Jorge, > > Yes, I was exploring changing the http.agent.name property value in > case where the sites either serve the mobile version or outright deny > the request if no agent is specified. > > For example the following URL will give Request Rejected response if > the User-Agent is not specified. > > http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod > > And the following URL will server a mobile version. > > http://www.techforless.com/cgi-bin/tech4less/60PN5000. > > So is it a good practice to set the http.agent.name to something > like the below , to mimic a Chrome browser? > > Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) > Chrome/41.0.2228.0 Safari/537.36 > > On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González > <[email protected]> wrote: >> Hi Meraj, >> >> Can you provide an example URL? explain exactly what you're after? if the >> page you're trying to fetch has a lot of javascript/ajax keep in mind that >> the browsers do a lot of stuff with the downloaded page, for instance when >> you enter a page, the HTML is downloaded, the referenced CSS files are also >> fetched and applied to the HTML (also inline styles, etc.), if any >> javascript is referenced is also downloaded and executed on top of the >> loaded DOM (also inline script tags). The same applies to fonts, etc. The >> browsers "knows" how to deal with all this resources, also the CSS is >> applied depending on which browser you're using. The Nutch crawler only >> knows about the downloaded HTML (similar to what you see when you view the >> source code of an HTML webpage) it doesn't know what a CSS style is, >> basically the crawler only is interested in: the links and the >> textual/binary content of the webpage, so when a page es fetched by Nutch, >> the HTML is downloaded but the other resources (fonts, styles, javascript) >> are not applied to the fetched page. >> >> Tweaking the http.agent.name property in the nutch-site.xml only will help >> with those sites that change what their response based on the user agent >> (one for mobile and other different for desktop browsers). This approach is >> being replaced by the responsive design, meaning that the user agent is not >> important for how the page is rendered. >> >> In the current trunk of the upcoming 1.10 version a plugin has been merged >> that could address this, basically this plugin uses selenium to render the >> page and then feed Nutch with the resulting HTML, meaning that >> ajax/javascript interactions will be present in the content that Nutch will >> parse in the next stage. >> >> Also we need more information about your use case or what you're trying to >> accomplish. >> >> Hope it helps, >> >> Regards, >> >> ----- Original Message ----- >> From: "Meraj A. Khan" <[email protected]> >> To: [email protected] >> Sent: Friday, February 27, 2015 12:47:06 AM >> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? >> >> In some instances the content that is downloaded in Fetch phase from a >> HTTP URL is not what you would get if you were to access the request >> from a well known browser like Google Chrome for example, that is >> because the server is expecting a user agent value that represents a >> browser. >> >> There is a http.agent.name property in nutch-site.xml, is it the same >> property that should be used to set the user agent to make the server >> respond to a Nutch get request the same way as it would for a request >> from a browser ? Or is there an another configurable property ? >> >> For example the user agent value for a Chrome browser is below. >> >> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) >> Chrome/41.0.2228.0 Safari/537.36 >> >> >> Thanks.

