Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Tony Mullins Mon, 17 Jun 2013 10:24:49 -0700

Yes after running the same code on different server machine , issue was
resolved ( or disappeared :) )


Thanks alot guyz for your help & support.

Tony.


On Mon, Jun 17, 2013 at 7:17 PM, Ing. Jorge Luis Betancourt Gonzalez <
[email protected]> wrote:

> I've experienced a similar issue on my development station running Mac
> 10.8 but the same code worked perfectly on my server VM running ubuntu, so
> no jira was created in the end. Also, in my case was fetching image files
> and not HTML content + the files was hosted locally so no connection
> problem was involved.
>
> ----- Mensaje original -----
> De: "feng lu" <[email protected]>
> Para: [email protected]
> Enviados: Lunes, 17 de Junio 2013 10:10:49
> Asunto: Re: Incomplete HTML content of a crawled Page in ParseFilter ?
>
> Hi Tony
>
> As Coskun said that you can set http.content.limit to -1, default is 65536,
> not file.content.limit property.
>
> <property>
>   <name>http.content.limit</name>
>   <value>65536</value>
>   <description>The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
>
>
>
> On Mon, Jun 17, 2013 at 7:58 PM, Tony Mullins <[email protected]
> >wrote:
>
> > html in my previous email was incorrect ( I was trying different dns
> > thinking its due to bad internet) ...
> > but in short I am getting incomplete html response....
> >
> > Is there any property which could let webpage wait for complete html to
> > load in Nutch ?
> >
> > Thanks,
> > Tony
> >
> >
> > On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected]
> > >wrote:
> >
> > > I have modified these values as
> > >
> > > <property>
> > >   <name>http.timeout</name>
> > >   <value>*20000*</value>
> > >   <description>The default network timeout, in
> > milliseconds.</description>
> > > </property>
> > >
> > > <property>
> > >   <name>file.content.limit</name>
> > >   <value>*-1*</value>
> > >
> > >   <description>The length limit for downloaded content using the file
> > >    protocol, in bytes. If this value is nonnegative (>=0), content
> longer
> > >    than it will be truncated; otherwise, no truncation at all. Do not
> > >    confuse this setting with the http.content.limit setting.
> > >   </description>
> > > </property>
> > >
> > > <property>
> > >   <name>http.max.delays</name>
> > >   <value>*200*</value>
> > >   <description>The number of times a thread will delay when trying to
> > >   fetch a page.  Each time it finds that a host is busy, it will wait
> > >   fetcher.server.delay.  After http.max.delays attepts, it will give
> > >   up on the page for now.</description>
> > > </property>
> > >
> > > And I am getting html for page *
> > >
> >
> http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y
> > > *  like this
> > >
> > > INFO  nutch.selector - page html is <!DOCTYPE HTML>
> > > <html>
> > > <head>
> > >
> > >   <title>Squarespace - Domain Not Claimed</title>
> > >   <meta http-equiv="X-UA-Compatible" content="chrome=1">
> > >
> > >   <script type="text/javascript" src="//
> > >
> >
> static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js
> > > "></script>
> > >
> > >   <script>
> > >
> > >     Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base",
> > > "squarespace-configuration-css",  function(Y) {
> > >
> > >       Y.on("domready", function() {
> > >
> > >         var lb = new Y.Squarespace.Lightbox({
> > >           disableNormalClose: true,
> > >           clickAnywhereToExit: false,
> > >           content: '<div class="bigtext"><div class="title">Domain Not
> > > Claimed</div><div class="description">This domain has been mapped to
> > > Squarespace, but it has not yet been claimed by a website.  If this is
> > your
> > > domain, claim it in the Domains tab of your website
> > manager.</div></div>',
> > >           margin: 100,
> > >           noHeightConstrain: true
> > >         });
> > >
> > >         lb.show();
> > >
> > >         lb.getContentEl().on("click", function(e) {
> > >           if (e.target.ancestor(".login-button", true)) {
> > >             document.location.href = '/config/';
> > >           }
> > >         });
> > >
> > >       });
> > >
> > >     });
> > >
> > >   </script>
> > >
> > >
> > > </head>
> > > <body class="squarespace-config squarespace-system-page">
> > >
> > >   <div class="minimal-logo">&nbsp;</div>
> > >
> > > </body>
> > > </html>
> > >
> > > So as you can see its not loading the complete page....
> > >
> > > Is there any other property that I need to modify ?
> > >
> > > Thanks
> > > Tony.
> > >
> > >
> > >
> > > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz <
> > > [email protected]> wrote:
> > >
> > >> Hi Tony,
> > >>
> > >> You may need to add http.content.limit parameter in nutch-site.xml
> file.
> > >>
> > >> for size-unlimited crawling:
> > >>
> > >> <property>
> > >>         <name>http.content.limit</**name>
> > >> *<value>-1</value>*
> > >>         <description>The length limit for downloaded content using the
> > >> file
> > >>             protocol, in bytes. If this value is nonnegative (>=0),
> > >> content longer
> > >>             than it will be truncated; otherwise, no truncation at
> all.
> > >> Do not
> > >>             confuse this setting with the http.content.limit setting.
> > >>         </description>
> > >>     </property>
> > >>
> > >>
> > >> Please refer to: http://wiki.apache.org/nutch/**nutch-default.xml<
> > http://wiki.apache.org/nutch/nutch-default.xml>
> > >>
> > >> Kind regards..
> > >> coskun...
> > >>
> > >>
> > >> On 06/17/2013 02:05 PM, Tony Mullins wrote:
> > >>
> > >>> Hi ,
> > >>>
> > >>> I am trying to crawl this url
> > >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68<
> > http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68>
> > >>> and getting the crawled page content in my ParseFIlter plugin like
> this
> > >>> String html = new String(webPage.getContent().**array());
> > >>> Then I am using this html to extract my required information....
> > >>>
> > >>> But its not returning me complete html of page. I have logged the
> > 'html'
> > >>> and I can see that log file contains incomplete html for the above
> > >>> link....
> > >>>
> > >>> Is there any size limit of page' content ? Or I am doing something
> > wrong
> > >>> here ?
> > >>>
> > >>> Thanks,
> > >>> Tony.
> > >>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>
> http://www.uci.cu
> http://www.uci.cu
>

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

Reply via email to