Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Meraj A. Khan Mon, 02 Mar 2015 13:01:33 -0800

Jorge ,

I think I spoke too soon , if I use the protocol-httpclient plugin , I
am unable to fetch  any page using the parsechecker.


I get a [Fatal Error] :1:1: Content is not allowed in prolog. error.

Are there any known issues with using protocol-httpclient , I am using
Nutch 1.7 I have the following settings in my nutch-site.xml

    <!-- Added based on the suggestion from nutch mailing list -->
    <property>
        <name>plugin.includes</name>
        
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>


    <property>
        <name>http.useHttp11</name>
        <value>true</value>
        <description>NOTE: at the moment this works only for
            protocol-httpclient.
            If true, use HTTP 1.1, if false use HTTP 1.0 .
        </description>
    </property>


Thanks.

On Sun, Mar 1, 2015 at 10:05 PM, Jorge Luis Betancourt González
<[email protected]> wrote:
> The general answer is: it dependes, usually is "polite" to present your robot 
> to the website so the webmaster knows what is accessing the site, this is why 
> google and a lot of other search engines (big and small) use a distinctive 
> name for their crawlers/bots. That being said, the first site that you 
> mention works fine for a quick parsechecker that I've executed:
>
> ➜  local  bin/nutch parsechecker 
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> fetching: 
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> parsing: 
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> contentType: text/html
> signature: 8e90c6d581f27c36828d433f746e4d7a
> ---------
> Url
> ---------------
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: "Dressing for the Dark"
> Outlinks: 151
>   outlink: toUrl: 
> http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css 
> anchor:
>   outlink: toUrl: 
> http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
>   outlink: toUrl: 
> http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
>   outlink: toUrl: 
> http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
> ...
>
> (trimmed due length)
>
> As for the second one I wasn't able to do a test, the provided blocks access 
> from my IP/country:
>
> This request is blocked by the SonicWALL Gateway Geo IP Service.
> Country Name:Cuba.
>
> Reading your experience with this website, looks like an error in the website 
> programming, basically I'm assuming they are saying if your User Agent is not 
> X,Y or Z then serve the mobile version, this could worth reporting.
>
> Trying to fool the website giving the impression that your bot is a regular 
> user by tweaking the user agent could work for now, but could draw in 
> webmaster's attention and could be a cause for blocking your access, this 
> depends a lot on the webmaster :). But for your particular case could be your 
> only solution if the webmaster doesn't have a problem with the increase in 
> traffic.
>
> Regards,
>
> ----- Original Message -----
> From: "Meraj A. Khan" <[email protected]>
> To: [email protected]
> Sent: Saturday, February 28, 2015 12:09:47 AM
> Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a 
> browser?
>
> Hi Jorge,
>
> Yes, I was exploring changing the http.agent.name property value in
> case where the sites either serve the mobile version or outright deny
> the request if no agent is specified.
>
> For example the following URL will give Request Rejected response if
> the User-Agent is not specified.
>
> http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
>
> And the following URL will server a mobile version.
>
> http://www.techforless.com/cgi-bin/tech4less/60PN5000.
>
> So is it a good practice to set the  http.agent.name  to something
> like the below , to mimic a Chrome browser?
>
> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/41.0.2228.0 Safari/537.36
>
> On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
> <[email protected]> wrote:
>> Hi Meraj,
>>
>> Can you provide an example URL? explain exactly what you're after? if the 
>> page you're trying to fetch has a lot of javascript/ajax keep in mind that 
>> the browsers do a lot of stuff with the downloaded page, for instance when 
>> you enter a page, the HTML is downloaded, the referenced CSS files are also 
>> fetched and applied to the HTML (also inline styles, etc.), if any 
>> javascript is referenced is also downloaded and executed on top of the 
>> loaded DOM (also inline script tags). The same applies to fonts, etc. The 
>> browsers "knows" how to deal with all this resources, also the CSS is 
>> applied depending on which browser you're using. The Nutch crawler only 
>> knows about the downloaded HTML (similar to what you see when you view the 
>> source code of an HTML webpage) it doesn't know what a CSS style is, 
>> basically the crawler only is interested in: the links and the 
>> textual/binary content of the webpage, so when a page es fetched by Nutch, 
>> the HTML is downloaded but the other resources (fonts, styles, javascript) 
>> are not applied to the fetched page.
>>
>> Tweaking the http.agent.name property in the nutch-site.xml only will help 
>> with those sites that change what their response based on the user agent 
>> (one for mobile and other different for desktop browsers). This approach is 
>> being replaced by the responsive design, meaning that the user agent is not 
>> important for how the page is rendered.
>>
>> In the current trunk of the upcoming 1.10 version a plugin has been merged 
>> that could address this, basically this plugin uses selenium to render the 
>> page and then feed Nutch with the resulting HTML, meaning that 
>> ajax/javascript interactions will be present in the content that Nutch will 
>> parse in the next stage.
>>
>> Also we need more information about your use case or what you're trying to 
>> accomplish.
>>
>> Hope it helps,
>>
>> Regards,
>>
>> ----- Original Message -----
>> From: "Meraj A. Khan" <[email protected]>
>> To: [email protected]
>> Sent: Friday, February 27, 2015 12:47:06 AM
>> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>>
>> In some instances the content that is downloaded in Fetch phase from a
>> HTTP URL is not what you would get if you were to access the request
>> from a well known browser like Google Chrome for example, that is
>> because the server is expecting a user agent value that represents a
>> browser.
>>
>> There is a http.agent.name property in nutch-site.xml, is it the same
>> property that should be used to set the user agent to make the server
>> respond to a Nutch get request the same way as it would for a request
>> from a browser ? Or is there an another configurable property ?
>>
>> For example the user agent value for a Chrome browser is below.
>>
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
>> Chrome/41.0.2228.0 Safari/537.36
>>
>>
>> Thanks.

Re: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Reply via email to