Yes after running the same code on different server machine , issue was resolved ( or disappeared :) )
Thanks alot guyz for your help & support. Tony. On Mon, Jun 17, 2013 at 7:17 PM, Ing. Jorge Luis Betancourt Gonzalez < [email protected]> wrote: > I've experienced a similar issue on my development station running Mac > 10.8 but the same code worked perfectly on my server VM running ubuntu, so > no jira was created in the end. Also, in my case was fetching image files > and not HTML content + the files was hosted locally so no connection > problem was involved. > > ----- Mensaje original ----- > De: "feng lu" <[email protected]> > Para: [email protected] > Enviados: Lunes, 17 de Junio 2013 10:10:49 > Asunto: Re: Incomplete HTML content of a crawled Page in ParseFilter ? > > Hi Tony > > As Coskun said that you can set http.content.limit to -1, default is 65536, > not file.content.limit property. > > <property> > <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > > > On Mon, Jun 17, 2013 at 7:58 PM, Tony Mullins <[email protected] > >wrote: > > > html in my previous email was incorrect ( I was trying different dns > > thinking its due to bad internet) ... > > but in short I am getting incomplete html response.... > > > > Is there any property which could let webpage wait for complete html to > > load in Nutch ? > > > > Thanks, > > Tony > > > > > > On Mon, Jun 17, 2013 at 4:43 PM, Tony Mullins <[email protected] > > >wrote: > > > > > I have modified these values as > > > > > > <property> > > > <name>http.timeout</name> > > > <value>*20000*</value> > > > <description>The default network timeout, in > > milliseconds.</description> > > > </property> > > > > > > <property> > > > <name>file.content.limit</name> > > > <value>*-1*</value> > > > > > > <description>The length limit for downloaded content using the file > > > protocol, in bytes. If this value is nonnegative (>=0), content > longer > > > than it will be truncated; otherwise, no truncation at all. Do not > > > confuse this setting with the http.content.limit setting. > > > </description> > > > </property> > > > > > > <property> > > > <name>http.max.delays</name> > > > <value>*200*</value> > > > <description>The number of times a thread will delay when trying to > > > fetch a page. Each time it finds that a host is busy, it will wait > > > fetcher.server.delay. After http.max.delays attepts, it will give > > > up on the page for now.</description> > > > </property> > > > > > > And I am getting html for page * > > > > > > http://www.amazon.com/Degree-Antiperspirant-Deodorant-Extreme-Blast/dp/B001ET769Y > > > * like this > > > > > > INFO nutch.selector - page html is <!DOCTYPE HTML> > > > <html> > > > <head> > > > > > > <title>Squarespace - Domain Not Claimed</title> > > > <meta http-equiv="X-UA-Compatible" content="chrome=1"> > > > > > > <script type="text/javascript" src="// > > > > > > static.squarespace.com/universal/scripts-v6/061620131943271011/yui-seed.js > > > "></script> > > > > > > <script> > > > > > > Y = YUI(YUI_CONFIG).use("squarespace-util", "squarespace-ui-base", > > > "squarespace-configuration-css", function(Y) { > > > > > > Y.on("domready", function() { > > > > > > var lb = new Y.Squarespace.Lightbox({ > > > disableNormalClose: true, > > > clickAnywhereToExit: false, > > > content: '<div class="bigtext"><div class="title">Domain Not > > > Claimed</div><div class="description">This domain has been mapped to > > > Squarespace, but it has not yet been claimed by a website. If this is > > your > > > domain, claim it in the Domains tab of your website > > manager.</div></div>', > > > margin: 100, > > > noHeightConstrain: true > > > }); > > > > > > lb.show(); > > > > > > lb.getContentEl().on("click", function(e) { > > > if (e.target.ancestor(".login-button", true)) { > > > document.location.href = '/config/'; > > > } > > > }); > > > > > > }); > > > > > > }); > > > > > > </script> > > > > > > > > > </head> > > > <body class="squarespace-config squarespace-system-page"> > > > > > > <div class="minimal-logo"> </div> > > > > > > </body> > > > </html> > > > > > > So as you can see its not loading the complete page.... > > > > > > Is there any other property that I need to modify ? > > > > > > Thanks > > > Tony. > > > > > > > > > > > > On Mon, Jun 17, 2013 at 4:13 PM, H. Coskun Gunduz < > > > [email protected]> wrote: > > > > > >> Hi Tony, > > >> > > >> You may need to add http.content.limit parameter in nutch-site.xml > file. > > >> > > >> for size-unlimited crawling: > > >> > > >> <property> > > >> <name>http.content.limit</**name> > > >> *<value>-1</value>* > > >> <description>The length limit for downloaded content using the > > >> file > > >> protocol, in bytes. If this value is nonnegative (>=0), > > >> content longer > > >> than it will be truncated; otherwise, no truncation at > all. > > >> Do not > > >> confuse this setting with the http.content.limit setting. > > >> </description> > > >> </property> > > >> > > >> > > >> Please refer to: http://wiki.apache.org/nutch/**nutch-default.xml< > > http://wiki.apache.org/nutch/nutch-default.xml> > > >> > > >> Kind regards.. > > >> coskun... > > >> > > >> > > >> On 06/17/2013 02:05 PM, Tony Mullins wrote: > > >> > > >>> Hi , > > >>> > > >>> I am trying to crawl this url > > >>> http://www.amazon.com/Levis-**Mens-550-Relaxed-Jean/dp/**B0018OKX68< > > http://www.amazon.com/Levis-Mens-550-Relaxed-Jean/dp/B0018OKX68> > > >>> and getting the crawled page content in my ParseFIlter plugin like > this > > >>> String html = new String(webPage.getContent().**array()); > > >>> Then I am using this html to extract my required information.... > > >>> > > >>> But its not returning me complete html of page. I have logged the > > 'html' > > >>> and I can see that log file contains incomplete html for the above > > >>> link.... > > >>> > > >>> Is there any size limit of page' content ? Or I am doing something > > wrong > > >>> here ? > > >>> > > >>> Thanks, > > >>> Tony. > > >>> > > >>> > > >> > > > > > > > > > -- > Don't Grow Old, Grow Up... :-) > > http://www.uci.cu > http://www.uci.cu >

