Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Tony Mullins Mon, 17 Jun 2013 04:59:38 -0700

html in my previous email was incorrect ( I was trying different dns
thinking its due to bad internet) ...
but in short I am getting incomplete html response....


Is there any property which could let webpage wait for complete html to
load in Nutch ?

Thanks,
Tony


On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected]>wrote:

> I have modified these values as
>
> <property>
>   <name>http.timeout</name>
>   <value>*20000*</value>
>   <description>The default network timeout, in milliseconds.</description>
> </property>
>
> <property>
>   <name>file.content.limit</name>
>   <value>*-1*</value>
>
>   <description>The length limit for downloaded content using the file
>    protocol, in bytes. If this value is nonnegative (>=0), content longer
>    than it will be truncated; otherwise, no truncation at all. Do not
>    confuse this setting with the http.content.limit setting.
>   </description>
> </property>
>
> <property>
>   <name>http.max.delays</name>
>   <value>*200*</value>
>   <description>The number of times a thread will delay when trying to
>   fetch a page.  Each time it finds that a host is busy, it will wait
>   fetcher.server.delay.  After http.max.delays attepts, it will give
>   up on the page for now.</description>
> </property>
>
> And I am getting html for page *
> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
> *  like this
>
> INFO  nutch.selector - page html is <!DOCTYPE HTML>
> <html>
> <head>
>
>   <title>Squarespace - Domain Not Claimed</title>
>   <meta http-equiv="X-UA-Compatible" content="chrome=1">
>
>   <script type="text/javascript" src="//
> static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js
> "></script>
>
>   <script>
>
>     Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base",
> "squarespace-configuration-css",  function(Y) {
>
>       Y.on("domready", function() {
>
>         var lb = new Y.Squarespace.Lightbox({
>           disableNormalClose: true,
>           clickAnywhereToExit: false,
>           content: '<div class="bigtext"><div class="title">Domain Not
> Claimed</div><div class="description">This domain has been mapped to
> Squarespace, but it has not yet been claimed by a website.  If this is your
> domain, claim it in the Domains tab of your website manager.</div></div>',
>           margin: 100,
>           noHeightConstrain: true
>         });
>
>         lb.show();
>
>         lb.getContentEl().on("click", function(e) {
>           if (e.target.ancestor(".login-button", true)) {
>             document.location.href = '/config/';
>           }
>         });
>
>       });
>
>     });
>
>   </script>
>
>
> </head>
> <body class="squarespace-config squarespace-system-page">
>
>   <div class="minimal-logo">&nbsp;</div>
>
> </body>
> </html>
>
> So as you can see its not loading the complete page....
>
> Is there any other property that I need to modify ?
>
> Thanks
> Tony.
>
>
>
> On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz <
> [email protected]> wrote:
>
>> Hi Tony,
>>
>> You may need to add http.content.limit parameter in nutch-site.xml file.
>>
>> for size-unlimited crawling:
>>
>> <property>
>>         <name>http.content.limit</**name>
>> *<value>-1</value>*
>>         <description>The length limit for downloaded content using the
>> file
>>             protocol, in bytes. If this value is nonnegative (>=0),
>> content longer
>>             than it will be truncated; otherwise, no truncation at all.
>> Do not
>>             confuse this setting with the http.content.limit setting.
>>         </description>
>>     </property>
>>
>>
>> Please refer to: 
>> http://wiki.apache.org/nutch/**nutch-default.xml<http://wiki.apache.org/nutch/nutch-default.xml>
>>
>> Kind regards..
>> coskun...
>>
>>
>> On 06/17/2013 02:05 PM, Tony Mullins wrote:
>>
>>> Hi ,
>>>
>>> I am trying to crawl this url
>>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68>
>>> and getting the crawled page content in my ParseFIlter plugin like this
>>> String html = new String(webPage.getContent().**array());
>>> Then I am using this html to extract my required information....
>>>
>>> But its not returning me complete html of page. I have logged the 'html'
>>> and I can see that log file contains incomplete html for the above
>>> link....
>>>
>>> Is there any size limit of page' content ? Or I am doing something wrong
>>> here ?
>>>
>>> Thanks,
>>> Tony.
>>>
>>>
>>
>

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Reply via email to