Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
Jorge ,

I think I spoke too soon , if I use the protocol-httpclient plugin , I
am unable to fetch  any page using the parsechecker.

I get a [Fatal Error] :1:1: Content is not allowed in prolog. error.

Are there any known issues with using protocol-httpclient , I am using
Nutch 1.7 I have the following settings in my nutch-site.xml

!-- Added based on the suggestion from nutch mailing list --
property
nameplugin.includes/name

valueprotocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property


property
namehttp.useHttp11/name
valuetrue/value
descriptionNOTE: at the moment this works only for
protocol-httpclient.
If true, use HTTP 1.1, if false use HTTP 1.0 .
/description
/property


Thanks.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
 The general answer is: it dependes, usually is polite to present your robot 
 to the website so the webmaster knows what is accessing the site, this is why 
 google and a lot of other search engines (big and small) use a distinctive 
 name for their crawlers/bots. That being said, the first site that you 
 mention works fine for a quick parsechecker that I've executed:

 ➜  local  bin/nutch parsechecker 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 fetching: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 parsing: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 contentType: text/html
 signature: 8e90c6d581f27c36828d433f746e4d7a
 -
 Url
 ---

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 -
 ParseData
 -

 Version: 5
 Status: success(1,0)
 Title: Dressing for the Dark
 Outlinks: 151
   outlink: toUrl: 
 http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css 
 anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
 ...

 (trimmed due length)

 As for the second one I wasn't able to do a test, the provided blocks access 
 from my IP/country:

 This request is blocked by the SonicWALL Gateway Geo IP Service.
 Country Name:Cuba.

 Reading your experience with this website, looks like an error in the website 
 programming, basically I'm assuming they are saying if your User Agent is not 
 X,Y or Z then serve the mobile version, this could worth reporting.

 Trying to fool the website giving the impression that your bot is a regular 
 user by tweaking the user agent could work for now, but could draw in 
 webmaster's attention and could be a cause for blocking your access, this 
 depends a lot on the webmaster :). But for your particular case could be your 
 only solution if the webmaster doesn't have a problem with the increase in 
 traffic.

 Regards,

 - Original Message -
 From: Meraj A. Khan mera...@gmail.com
 To: user@nutch.apache.org
 Sent: Saturday, February 28, 2015 12:09:47 AM
 Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a 
 browser?

 Hi Jorge,

 Yes, I was exploring changing the http.agent.name property value in
 case where the sites either serve the mobile version or outright deny
 the request if no agent is specified.

 For example the following URL will give Request Rejected response if
 the User-Agent is not specified.

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

 And the following URL will server a mobile version.

 http://www.techforless.com/cgi-bin/tech4less/60PN5000.

 So is it a good practice to set the  http.agent.name  to something
 like the below , to mimic a Chrome browser?

 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
 Chrome/41.0.2228.0 Safari/537.36

 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
 jlbetanco...@uci.cu wrote:
 Hi Meraj,

 Can you provide an example URL? explain exactly what you're after? if the 
 page you're trying to fetch has a lot of javascript/ajax keep in mind that 
 the browsers do a lot of stuff with the downloaded page, for instance when 
 you enter a page, the HTML is downloaded, the referenced CSS files are also 
 fetched and applied to the HTML (also inline styles, etc.), if any 
 javascript is referenced is also downloaded and executed on top of the 
 loaded DOM (also inline script tags). The same applies to fonts, etc. The 
 browsers knows how to deal with all this resources, also the CSS is 
 applied depending on which browser you're using. The Nutch crawler only 
 knows 

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

2015-03-02 Thread Meraj A. Khan
Thanks Jorge, I appreciate your help.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
 The general answer is: it dependes, usually is polite to present your robot 
 to the website so the webmaster knows what is accessing the site, this is why 
 google and a lot of other search engines (big and small) use a distinctive 
 name for their crawlers/bots. That being said, the first site that you 
 mention works fine for a quick parsechecker that I've executed:

 ➜  local  bin/nutch parsechecker 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 fetching: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 parsing: 
 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 contentType: text/html
 signature: 8e90c6d581f27c36828d433f746e4d7a
 -
 Url
 ---

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
 -
 ParseData
 -

 Version: 5
 Status: success(1,0)
 Title: Dressing for the Dark
 Outlinks: 151
   outlink: toUrl: 
 http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css 
 anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
   outlink: toUrl: 
 http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
 ...

 (trimmed due length)

 As for the second one I wasn't able to do a test, the provided blocks access 
 from my IP/country:

 This request is blocked by the SonicWALL Gateway Geo IP Service.
 Country Name:Cuba.

 Reading your experience with this website, looks like an error in the website 
 programming, basically I'm assuming they are saying if your User Agent is not 
 X,Y or Z then serve the mobile version, this could worth reporting.

 Trying to fool the website giving the impression that your bot is a regular 
 user by tweaking the user agent could work for now, but could draw in 
 webmaster's attention and could be a cause for blocking your access, this 
 depends a lot on the webmaster :). But for your particular case could be your 
 only solution if the webmaster doesn't have a problem with the increase in 
 traffic.

 Regards,

 - Original Message -
 From: Meraj A. Khan mera...@gmail.com
 To: user@nutch.apache.org
 Sent: Saturday, February 28, 2015 12:09:47 AM
 Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a 
 browser?

 Hi Jorge,

 Yes, I was exploring changing the http.agent.name property value in
 case where the sites either serve the mobile version or outright deny
 the request if no agent is specified.

 For example the following URL will give Request Rejected response if
 the User-Agent is not specified.

 http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

 And the following URL will server a mobile version.

 http://www.techforless.com/cgi-bin/tech4less/60PN5000.

 So is it a good practice to set the  http.agent.name  to something
 like the below , to mimic a Chrome browser?

 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
 Chrome/41.0.2228.0 Safari/537.36

 On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
 jlbetanco...@uci.cu wrote:
 Hi Meraj,

 Can you provide an example URL? explain exactly what you're after? if the 
 page you're trying to fetch has a lot of javascript/ajax keep in mind that 
 the browsers do a lot of stuff with the downloaded page, for instance when 
 you enter a page, the HTML is downloaded, the referenced CSS files are also 
 fetched and applied to the HTML (also inline styles, etc.), if any 
 javascript is referenced is also downloaded and executed on top of the 
 loaded DOM (also inline script tags). The same applies to fonts, etc. The 
 browsers knows how to deal with all this resources, also the CSS is 
 applied depending on which browser you're using. The Nutch crawler only 
 knows about the downloaded HTML (similar to what you see when you view the 
 source code of an HTML webpage) it doesn't know what a CSS style is, 
 basically the crawler only is interested in: the links and the 
 textual/binary content of the webpage, so when a page es fetched by Nutch, 
 the HTML is downloaded but the other resources (fonts, styles, javascript) 
 are not applied to the fetched page.

 Tweaking the http.agent.name property in the nutch-site.xml only will help 
 with those sites that change what their response based on the user agent 
 (one for mobile and other different for desktop browsers). This approach is 
 being replaced by the responsive design, meaning that the user agent is not 
 important for how the page is rendered.

 In the current trunk of the upcoming 1.10 version a plugin has been merged 
 that could