Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
Jorge , I think I spoke too soon , if I use the protocol-httpclient plugin , I am unable to fetch any page using the parsechecker. I get a [Fatal Error] :1:1: Content is not allowed in prolog. error. Are there any known issues with using protocol-httpclient , I am using Nutch 1.7 I have the following settings in my nutch-site.xml !-- Added based on the suggestion from nutch mailing list -- property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property property namehttp.useHttp11/name valuetrue/value descriptionNOTE: at the moment this works only for protocol-httpclient. If true, use HTTP 1.1, if false use HTTP 1.0 . /description /property Thanks. On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: The general answer is: it dependes, usually is polite to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed: ➜ local bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod contentType: text/html signature: 8e90c6d581f27c36828d433f746e4d7a - Url --- http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod - ParseData - Version: 5 Status: success(1,0) Title: Dressing for the Dark Outlinks: 151 outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor: outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor: ... (trimmed due length) As for the second one I wasn't able to do a test, the provided blocks access from my IP/country: This request is blocked by the SonicWALL Gateway Geo IP Service. Country Name:Cuba. Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting. Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic. Regards, - Original Message - From: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing the http.agent.name property value in case where the sites either serve the mobile version or outright deny the request if no agent is specified. For example the following URL will give Request Rejected response if the User-Agent is not specified. http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod And the following URL will server a mobile version. http://www.techforless.com/cgi-bin/tech4less/60PN5000. So is it a good practice to set the http.agent.name to something like the below , to mimic a Chrome browser? Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Hi Meraj, Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers knows how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows
Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
Thanks Jorge, I appreciate your help. On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: The general answer is: it dependes, usually is polite to present your robot to the website so the webmaster knows what is accessing the site, this is why google and a lot of other search engines (big and small) use a distinctive name for their crawlers/bots. That being said, the first site that you mention works fine for a quick parsechecker that I've executed: ➜ local bin/nutch parsechecker http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod fetching: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod parsing: http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod contentType: text/html signature: 8e90c6d581f27c36828d433f746e4d7a - Url --- http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod - ParseData - Version: 5 Status: success(1,0) Title: Dressing for the Dark Outlinks: 151 outlink: toUrl: http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor: outlink: toUrl: http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor: outlink: toUrl: http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor: ... (trimmed due length) As for the second one I wasn't able to do a test, the provided blocks access from my IP/country: This request is blocked by the SonicWALL Gateway Geo IP Service. Country Name:Cuba. Reading your experience with this website, looks like an error in the website programming, basically I'm assuming they are saying if your User Agent is not X,Y or Z then serve the mobile version, this could worth reporting. Trying to fool the website giving the impression that your bot is a regular user by tweaking the user agent could work for now, but could draw in webmaster's attention and could be a cause for blocking your access, this depends a lot on the webmaster :). But for your particular case could be your only solution if the webmaster doesn't have a problem with the increase in traffic. Regards, - Original Message - From: Meraj A. Khan mera...@gmail.com To: user@nutch.apache.org Sent: Saturday, February 28, 2015 12:09:47 AM Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser? Hi Jorge, Yes, I was exploring changing the http.agent.name property value in case where the sites either serve the mobile version or outright deny the request if no agent is specified. For example the following URL will give Request Rejected response if the User-Agent is not specified. http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod And the following URL will server a mobile version. http://www.techforless.com/cgi-bin/tech4less/60PN5000. So is it a good practice to set the http.agent.name to something like the below , to mimic a Chrome browser? Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Hi Meraj, Can you provide an example URL? explain exactly what you're after? if the page you're trying to fetch has a lot of javascript/ajax keep in mind that the browsers do a lot of stuff with the downloaded page, for instance when you enter a page, the HTML is downloaded, the referenced CSS files are also fetched and applied to the HTML (also inline styles, etc.), if any javascript is referenced is also downloaded and executed on top of the loaded DOM (also inline script tags). The same applies to fonts, etc. The browsers knows how to deal with all this resources, also the CSS is applied depending on which browser you're using. The Nutch crawler only knows about the downloaded HTML (similar to what you see when you view the source code of an HTML webpage) it doesn't know what a CSS style is, basically the crawler only is interested in: the links and the textual/binary content of the webpage, so when a page es fetched by Nutch, the HTML is downloaded but the other resources (fonts, styles, javascript) are not applied to the fetched page. Tweaking the http.agent.name property in the nutch-site.xml only will help with those sites that change what their response based on the user agent (one for mobile and other different for desktop browsers). This approach is being replaced by the responsive design, meaning that the user agent is not important for how the page is rendered. In the current trunk of the upcoming 1.10 version a plugin has been merged that could