Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Tony Mullins Mon, 17 Jun 2013 04:44:20 -0700

I have modified these values as

<property>
  <name>http.timeout</name>
  <value>*20000*</value>
  <description>The default network timeout, in milliseconds.</description>
</property>


<property>
  <name>file.content.limit</name>
  <value>*-1*</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

<property>
  <name>http.max.delays</name>
  <value>*200*</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

And I am getting html for page *
http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
*  like this

INFO  nutch.selector - page html is <!DOCTYPE HTML>
<html>
<head>

  <title>Squarespace - Domain Not Claimed</title>
  <meta http-equiv="X-UA-Compatible" content="chrome=1">

  <script type="text/javascript" src="//
static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js
"></script>

  <script>

    Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base",
"squarespace-configuration-css",  function(Y) {

      Y.on("domready", function() {

        var lb = new Y.Squarespace.Lightbox({
          disableNormalClose: true,
          clickAnywhereToExit: false,
          content: '<div class="bigtext"><div class="title">Domain Not
Claimed</div><div class="description">This domain has been mapped to
Squarespace, but it has not yet been claimed by a website.  If this is your
domain, claim it in the Domains tab of your website manager.</div></div>',
          margin: 100,
          noHeightConstrain: true
        });

        lb.show();

        lb.getContentEl().on("click", function(e) {
          if (e.target.ancestor(".login-button", true)) {
            document.location.href = '/config/';
          }
        });

      });

    });

  </script>


</head>
<body class="squarespace-config squarespace-system-page">

  <div class="minimal-logo">&nbsp;</div>

</body>
</html>

So as you can see its not loading the complete page....

Is there any other property that I need to modify ?

Thanks
Tony.



On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz
<[email protected]>wrote:

> Hi Tony,
>
> You may need to add http.content.limit parameter in nutch-site.xml file.
>
> for size-unlimited crawling:
>
> <property>
>         <name>http.content.limit</**name>
> *<value>-1</value>*
>         <description>The length limit for downloaded content using the file
>             protocol, in bytes. If this value is nonnegative (>=0),
> content longer
>             than it will be truncated; otherwise, no truncation at all. Do
> not
>             confuse this setting with the http.content.limit setting.
>         </description>
>     </property>
>
>
> Please refer to: 
> http://wiki.apache.org/nutch/**nutch-default.xml<http://wiki.apache.org/nutch/nutch-default.xml>
>
> Kind regards..
> coskun...
>
>
> On 06/17/2013 02:05 PM, Tony Mullins wrote:
>
>> Hi ,
>>
>> I am trying to crawl this url
>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68>
>> and getting the crawled page content in my ParseFIlter plugin like this
>> String html = new String(webPage.getContent().**array());
>> Then I am using this html to extract my required information....
>>
>> But its not returning me complete html of page. I have logged the 'html'
>> and I can see that log file contains incomplete html for the above
>> link....
>>
>> Is there any size limit of page' content ? Or I am doing something wrong
>> here ?
>>
>> Thanks,
>> Tony.
>>
>>
>

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Reply via email to